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Abstract 


Accurately estimating the probabilities of pairwise-alignments (PWAs) of ancestral and de¬ 
scendant sequences is essential when aiming for precise evolutionary analyses on homologous 
[i.e., ancestor-sharing) sequences. Doing so in the presence of realistic insertions/deletions, 
however, has been hampered by many formidable technical challenges. Recently [1], based 
on the theoretical formulation dealing with stochastic models of sequence evolution with 
realistic insertions/deletions [2, 3], we invented an algorithm to compute practically exact 
probabilities of insertion-type gaps alone and those of deletion-type gaps alone. However, 
accurate computation of the hnal type of gapped segments (called case-(iv)), in each of 
which intersion-type gaps and deletion-type gaps coexist and adjoin each other, has been 
left unresolved as the ’’last piece of the puzzle” of accurately estimating the probabilities of 
ancestor-descendant PWAs. 

Here, we construct a new perturbation method to provide this ’’last piece of the puzzle”, 
that is, to accurately, and systematically, compute the probabilities of case-(iv) gapped 
segments. In short, this new method classihes (or ’’colors”) the sites in the sub-sequence 
evolution in each case-(iv) segment into ’’ancestral” and ’’descendant” types; then it considers 
only those insertions/deletions which change the coloring-pattern of the sub-sequence as 
’’perturbations”; and hnally it computes the probabilities of the contributing evolutionary 
histories of the sub-sequence from the lowest-order perturbation terms upward. Our in silico 
’’experiments” indicated that this new method computes probabilities quite accurately, even 
with only 2nd- and 3rd-order terms. 

Combining the results of this study and of a previous work of ours [1], we should 
now be able to estimate the probabilities of ancestor-descendant PWAs quite accurately. 



Thus, this study represents a signihcant step toward the ultimate goal of precise evolu¬ 
tionary analyses on homologous sequences. The method reported here has been imple¬ 
mented in an open-source package of prototype Perl scripts and modules, named ”LAST- 
PIECE(_P)”, which is available at the FTP repository of the ANEX project in Bioinformat¬ 
ics.org (https://www.bioinformatics.org/ftp/pub/anex/). 

[Keywords: pairwise sequence alignment (PWA), probability, evolution, (stochastic) 

sequence evolution model, insertion/deletion (indel), accurate computation, perturbation 
method, DNA sequence, biological sequence ] 

[Abbreviations: hidden Markov model (HMM), multiple sequence alignment (MSA), pre¬ 
served ancestral site (PAS), pairwise sequence alignment (PWA), Thorne-Kishino-Felsenstein 
(TKF) ] 
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1 Introduction 

1.1 Background 

Aligning homologous, i.e., ancestor-sharing, sequences is a cornerstone of homology-based 
analyses of bio-molecular sequences, such as DNA, RNA, and protein sequences {e.g., [4, 5, 
6, 7, 8, 9]). Traditionally, a sequence alignment has been represented in the form of a matrix, 
in which each row represents a sequence and each row, often called a ’’site”, represents a 
set of residues descended from a common ancestral residue {e.g., [4, 10]); when a sequence 
lacks a residue, a gap (usually denoted by a dash, ”-”) is placed in the corresponding cell. 
Depending on the number of sequences involved, there are two major categories of sequence 
alignments: a pairwise (sequence) alignment (PWA) that consists of two sequences {e.g., 
[11, 12]), and a multiple sequence alignment (MSA) that consists of three or more sequences 
{e.g., [13, 14, 4]). In principle, the alignment of some homologous sequences is a product of 
the evolution of the sequence, and should be determined uniguely by the evolutionary events 
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such as insertions and deletions (often referred collectively as ’’indels”).^ ’ ^ 

The inevitable serious problem is that the nature will never give us such a true sequence 
alignment as it is; all we are left with is a set of extant sequences consisting only of residues, 
i.e., with all gaps removed. This means that we have to infer, or reconstruct, the alignment 
from the extant sequences, by placing gaps so that they will collectively represent some 
plausible (and hopefully close to true) evolutionary history of the sequences. At least up 
to now, the prevalent practice is to (attempt to) hnd an alignment that optimizes a score 
function prescribed by certain rules {e.g., [11, 12, 24, 13, 14, 4, 25, 26, 27, 10, 28]). 

Unfortunately, such reconstruction of sequence alignments has turned out to be a very 
tough, error-prone, process; some recent studies on the sequence alignment errors indicated 
that a considerable fraction, often even a majority or a near-majority, of gaps are erroneous 
[e.g., [10, 29, 30]). Regarding the causes of such alignment errors, some studies revealed 
that the inherent stochasticity of the sequence evolution plays an important, or dominant, 
role {e.g., [31, 32, 30]). In other words, the true sequence alignment very frequently is not 
the optimum, because the evolutionary processes are stochastic.^ These studies suggest that 
merely hnding a single optimum alignment is not enough, and that the most truthful way 
should be to present a probability distribution of alignments inferred from the set of extant 
sequences. 

Actually, the idea of providing the probability distribution of alignments, often referred to 

^ An insertion of a residue creates a column with gaps in all sequences except the offsprings of the ancestral 

sequence that underwent the insertion. A deletion of a residue creates a column with gaps in the offsprings 

of the ancestral sequence that experienced the deletion. 

^In this study, we deal with codmear sequence alignments; an alignment is called ’’collinear” {e.g., [15]) if it 

is devoid of genomic rearrangements such as inversions, duplications, and translocations {e.g., [16, 17, 18, 19]). 

(Possibly non-collinear) alignments of (usually very long) sequences that possibly underwent such genomic 

rearrangements are called ’’genome alignments”, and they are not the subject of this study. Readers who 

are interested in genome alignments should refer, e.g., to: [15, 20, 21, 22, 23]. 

^In fact, this is the case even with the ideal score of the log-probability under the stochastic evolution 

model that actually created the true alignment. 
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as ” statistical alignmenf [33], is not new. As far as we know, the first attempt to compute the 
probabilities of pairwise sequence alignments and use them as scores was made in the ground¬ 
breaking work (in 1986) by Bishop and Thompson [34], who used a simple probabilistic 
model to assign ’’probabilities” to gaps and gapless columns . Then, in another ground¬ 
breaking work (in 1991), Thorne, Kishino and Felsenstein proposed an evolution model of 
biological sequences allowing only single-residue insertions/deletions (indels) (TKF91, in 
[35]). Then, they further attempted to ’’inch toward” the realistic sequence evolution, by 
proposing an evolution model that permits multi-residue indels of some sorts by considering 
a biological sequence as a sequence of fragments, each of which consisting of one or more 
residues, and by allowing only single-fragment indels (TKF92, in [36]). The computation of 
alignment probabilities under these TKF models were later cast into standard Hidden Markov 
Models (HMMs) {e.g., [33, 37]), which were easier to handle than the original computational 
procedures derived by TKF [35, 36]. Although standard HMMs have flaws similar to those 
of the TKF91 & TKF92 models, such as the inability to faithfully take account of nested 
and/or overlapping indels, there were still some attempts to incorporate some effects of 
overlapping indels into standard HMMs {e.g., [37]). Meanwhile, as an attempt to handle 
more realistic sequence evolution models, Miklos and Toroczkai [38] proposed a method 
to compute alignment probabilities under an evolution model that allows any lengths of 
insertions (with geometrically distributed rates) but only single-residue deletions. 

The evolution models handled up to then were dissatisfactory in the sense that their 
insertion/deletion processes are not realistic, in at least two ways: (1) most of them permit 
only single-residue indels or multi-residue indels with the geometrically distributed rates, 
whereas many empirical studies indicate that multi-residue indels occur guite often, and 
with the rates following power-law {e.g., [39, 40, 41, 42, 43, 44, 45])^; and (2) most of them 
(or all except [38] and [37]) cannot take account of nested and/or overlapping indels, which 

^Some studies based on standard HMMs alleviated this flaw {not fully but to some extent) by using mixed 
geometric distributions {e.g., [25, 31]). 
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are expected from natural evolution processes of biological sequences. 

In our view, a satisfactory method was first proposed in the milestone work (in 2004) by 
Mikos, hunter, and Holmes [2], Their ’’long indel” model is a space-homogeneous® genuine 
sequence evolution model ^ of a biological sequence that can in principle incorporate any 
indel length distributions, including biologically realistic power-laws. This study [2] made a 
couple of new achievements: (i) they verbally proved that, under the ’’long indel” model 
the probability of an ancestor-descendant PWA® can be factorized into the product of the 
probabilities of ” chop zones”, each of which is delimited by a gapless column (or the left-end 
of the alignment) and the next gapless column (or the right-end of the alignment), excluding 
the former column and including the latter one; (ii) after arguing that the probability of 
each chop zone should be approximated well by the summation of the probabilities of indel 
histories with less than an arbitrary number of indels which are shorter than an arbitrary 

®A sequence evolution model is called ’’space-homogenous” if its rates of evolutionary events are uniform, 

i.e., independent of the positions, along the sequence. 

sequence evolution model is regarded as genuine only when it follows the evolutionary principle; The 

evolutionary principle requires that, when a certain time-interval is divided into two or more sub-time- 

intervals, the probability of each evolution process of a sequence during the time-interval must result from 

multiplying the probabilities of sub-processes of the sequence as a whole during all these sub-time-intervals. 

In other words, the evolutionary principle dictates that the probability of each evolution process can be 

factorized vertically ( i.e., along the time-interval), in contrast to horizontally {i.e., along the sequence) as 

HMMs do. In the context of a continuous-time Markov model, the evolutionary principle, in conjunction with 

the ’’completeness” of the set of (intermediate) states, leads to the famous Chapman-Kolmogorov equation, 

which in turn is equivalent to the defining equation of the hnite-time transition probability operator {e.g., 

Eq. (R3.18) in [3]). 

®This factorization of the PWA probability does not necessarily work in a genuine sequence evolution 

model in general. A previous study of ours [3] provided a set of conditions under which the alignment 

probability in a genuine sequence evolution model is factorable. It is exactly because the ’’long indel” model 

satisfies these conditions that the alignment probability in the model is factorable. 

^Although they [2] discussed simply a PWA of two sequences in general, it is actually equivalent to an 

ancestor-descendant PWA, because the ’’long indel” model is time-reversible. Here we specifically put forth 

an ancestor-descendant PWA because it is the main focus of this study. 



length (finite trajectory approximation), they provided an algorithm to recursively calculate 
the probability of each individual indel history (referred to as the ’’trajectory likelihood”); 
(iii) they provided a dynamic programming algorithm of O(L^) time-complexity, where L is 
the size of the sequence lengths, to compute the joint probability of two sequences from the 
probabilities of the chop zones; (iv) they also came up with an algorithm (a sort of Viterbi 
algorithm [46]) to search for an alignment with the largest probability under the ’’long indel” 
model; although they did not explicitly show it, they used it to validate their method using 
a set of structural sequence alginments (HOMSTRAD [47]). 

Since the aforementioned milestone work by by Mikos, Lunter, and Holmes [2], more than 
a decade had passed without any particular advances regarding the application of the genuine 
sequence evolution models to the analytical or deductive studies, although some advances had 
been made on the study using the simulators of genuine sequence evolution {e.g., [48, 49, 50]). 
It was our previous studies [3, 1, 30] that made some advances for the first time since [2] in 
the deductive study of statistical alignment under the genuine sequence evolution models. 
One of our major achievements was to answer the question on the factorability of alignment 
probabilities [3]. More precisely, after defining a ’’gapped segment” as delimited by a gapless 
column and the next gapless column (similarly to the definition of the aforementioned ” chop 
zone”) but excluding both, and we first reformulated the factorability problem so that we will 
ask whether the alignment probability (under general, genuine sequence evolution model) 
can be factorized into the product of an overall factor and the contributions from the gapped 
segments. Then, we provided the conditions on the indel rates and the exit rate under which 
the alignment probability is indeed factorable, for ancestor-descendant PWAs and for MSAs. 

Another major achievement was to provide concrete methods to calculate the contribution 
from each gapped segment [1]. The ’’long indel” paper [2] did clearly provide methods 
to compute alignment probabilities and the joint probability of a sequence pair using the 
probabilities of chop zones as a building block, as well as an algorithm to compute the 
probability of each individual indel history that will contribute to the probability of a chop 
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zone. In that paper ([2]), however, it was unclear how the probability of each chop zone was 
computed. In a previous paper of ours [1], we addressed this problem. First, we classihed the 
gapped segments (, each of which in an ancestor-descendant PWA is simply a chop-zone with 
the gapless column on its right removed,) into four major categories, which we referred to as 
case-(i), -(ii), -(iii), and -(iv) segments: a case-(i) ’’gapped” segment contains no ancestral or 
descendant residues; a case-(ii) segment contains some ancestral residues but no descendant 
residues; a case-(iii) segment contains some descendant residues but no ancestral residues; 
and a case-(iv) segment contains some ancestral residues and some descendant residues, 
and none of them are homologous to one another. Then, applying either of the two 

dehning integral equations of sequence evolution (Eqs.(R4.4) &(R4.5) in [3]) to each gapped 
segment, we derived the algorithms to numerically compute ’’practically exact” probabilities 
of (the gap conhgurations of) case-(i), (ii), and (iii) gapped segments [1]. Because of technical 
difficulties, however, we were not able to give an algorithm to compute practically, or almost, 
exact probabilities of case-(iv) gapped segments; so we settled for providing methods to 
compute all contributions from parsimonious and next-to-parsimonious indel histories that 
are responsible for each case-(iv) segment [1]. 

So, combining the results of Mikos, hunter, and Holmes [2] and our previous results [3, 1], 
we are now almost in a position to compute ’’practically exact” probabilities of ancestor- 
descendant PWAs: we now know that ,under a genuine sequence evolution model satisfying 

^^Following the viewpoint of [2], we don’t care about the relative order between the ancestral and descen¬ 
dant residues in each case-(iv) gapped segment,, because we consider an alignment as a homology structure 
[51, 52], in which the order makes sense only among residues in the same sequence(, and in which homologous 

residues are vertically aligned to each other, of course). 

^^In terms of gap-configurations, a case-(i) ’’gapped” segment contains no gaps, a case-(ii) segment contains 

only ’’deletion-type” gaps in the descendant sequence, a case-(iii) segment contains only ”insertion-type” 
gaps in the ancestral sequence, and a case-(iv) segment contains both ’’insertion-type” gaps in the ancestral 
sequence and ’’deletion-type” gaps in the descendant sequence, which are adjoining each other, i.e., not 
mediated by any gapless columns. 
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the factorability conditions, the alignment probability is factorized into the product of an 
overall factor and the contributions from gapped segments; and we already have algorithms 
to compute practically exact probabilities of case-(i), (ii) and (iii) gapped segments; we are 
now left with only one issue, which is computing the practically (or nearly) exact probabilities 
of case-(iv) gapped segments. The purpose of this study is to attempt to provide this last 
piece of the ’’puzzle” of computing the probability of ancestor-descendant PWAs. Once 
this ’’puzzle” is solved, it will be relatively easy to apply the solution to a wide range 
of problems, such as the construction of probabilistic ancestor-descendant PWAs, as well 
as the approximate construction of probabilistic MSAs. Therefore, this ’’last piece of the 
puzzle”, if provided, will greatly enhance our ability to truthfully reconstruct the alignments 
of homologous sequences based on genuine sequence evolution models. 


In this study, we will attempt to provide the ” last piece” by constructing a new perturba¬ 
tion method, which divides the instantaneous rate operator of a genuine sequence evolution 
model into ’’base” and ’’perturbation” parts in a different manner than the old perturbation 
method previously provided [3, 1]. In the old method, the ’’base” part consisted only of the 
exit rate term and the ’’perturbation” parts contained all insertion and deletion operators 
and rates accompanying them. Therefore, the perturbation level in the old method is noth¬ 
ing other than the number of indel events constituting each indel history. In contrast, in 
the new method, even the zeroth-order probabilities may be contributed from some indel 
histories consisting of a large number of indels each. Therefore, there is a hope that, in 
this new perturbation method, even the summation of low order terms could give a good 
approximation of the exact probabilities (of case-(iv) gapped segments), and we will see that 
this is indeed the case. 
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1.2 Notes on recent developments 


After having finished this study and while preparing this manuscript, we learned that there 
have been a couple of new developments regarding this subject of accurate PWA probability 
computation [53, 54, 55]. First, in an attempt to speed up Miklos et ah’s approach to 
compute PWA probabilities [2], a simulation-based approach has been devised [53]. A key 
strategy of theirs is to decouple the stage of computing the ’’chop-zone” probabilities and 
the stage of PWA inference, by re-using the ’’chop-zone” probabilities pre-computed in the 
hrst stage. We view this development very favorably, and we hope that they will go on 
to extend their efforts to statistical MSA problems. (Although the current implementation 
seems to use only geometric length distributions, incorporating, e.g., power-law distributions 
does not seem so hard.) In principle, this simulation-based approach could also solve exactly 
the same problem that we are about to address in this paper. Nevertheless, we strongly 
believe that this deductive study of ours is still indispensable, for at least two reasons. One 
reason is that simulation-based approaches and deductive approaches are two essential and 
complementary kinds of approaches, on both of which any scientihc disciplines have been 
developed and advanced properly. It is only if these two kinds of approaches are advanced 
soundly that the study of a scientihc discipline will develop in a wholesome manner. The 
other reason is that we, human-beings, are, after all, creatures of reasoning. Every time 
a new fact is revealed, there arise strong demands for the reason(s) behind it. Generally, 
deductive studies are much better at satisfying these demands than simulation-based studies. 
Therefore, we consider it absolutely necessary to disclose this deductive study of ours now, 
even after the advent of a simulation-based approach [53]. 

Second, there have been a couple of attempts to incorporate the effects of overlapping 
indels into the framework of (pair-)HMMs [54, 55], although with the instantaneous in- 

you can see, this strategy is very similar to the strategy employed by ANEX, a program package we 
recently developed to address statistical MSA problem under genuine sequence evolution models [56]. See 
also footnote 30 below. 


12 



del length distributions being geometric. Especially in [55], an attempt has been made to 
approximate the master equation of continuous-time Markov model (with geometric indel 
length distributions) within (pair-)HMMs (or hnite-state automata). We consider that these 
attempts are laudable. At the same time, however, we hud it a pity that these studies 
conhned themselves in biologically unrealistic geometric indel length distributions. We wish 
that they could have extended their efforts to more biologically realistic power-law indel 
length distributions, or at least to mixed geometric distributions. 


1.3 Structure of This Paper 

This article is structured as follows. In Section 2, we explain the basic ideas and principles 
underlying the method proposed here. Then, in Section 3, we propose the basic framework 
based on the theory we previously provided [3, 1], which extends a space-homogeneous 
theory [2] (see also [48]) to more general situations. Based on the proposed basic framework, 
in Section 4, we construct a new ’’perturbation theory” that enables us to compute the 
probabilities of case-(iv) gap-patterns as accurately as desired. The next section (Section 5) 
unfolds the concrete computations at the 2nd- and 3rd-order levels in this new ’’perturbation 

^^Incidentally, many (or most) of the combinations of parameter values in the simulation study of [55] 
seem unrealistic for a study of homologous sequences. In our view, the product t x /i (, where t is the 
time-lapse and /i is the deletion rate,) must be less than the upper-bound of about 1/4 (« 4 x 1/16, 
where the 4 (substitutions/site) is an (extremely generous) upper-limit of the evolutionary distance with 
detectable homology (via residue configurations), and the 1/16 (deletions/substitution) is a half of the 1/8 
(indels/substitutions) estimated in a genome-wide analysis [57]); and the product should usually be much 
less than this upper-bound (be., 1/4). Otherwise, it should be extremely hard, or impossible, to detect 
homology between the ancestral and descendant sequences. For example, when t = 8 and fj, = 0.5, we expect 
that each site suffers (at least) 4 (= 8 x 0.5) deletions on average!; (actually, though, each site can suffer at 
most one deletion, because it will never come back once it is deleted); how is it possible to detect homology 
in such a circumstance? We strongly urge the researchers (especially theoretical ones) in this field to have a 
decent sense of biologically realistic scales, which will help avoid futile disputes, for example. 
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theory.” Then, in Section 6, we provide a prototype algorithm to compute the probabilities 
to the third-order level of ’’perturbation”, and demonstrate how accurate its results are. 
Finally, in Section 7, we discuss some of possible further developments and outstanding 
issues. 

For clarity, in this article, we focus only on the sequence evolution via insertions/deletions, 
by assuming that the probability of a sequence alignment (given a hxed phylogenetic tree, 
if necessary) can be decoupled into the product of the probability under the substitution 
model and the probability under the insertion/deletion model. This decoupling can be done 
if the sequence evolution model satishes a set of conditions, for example, as we showed before 
[30] (via a generalization of the proof by Kim and Sinha [58] ), the following three conditions 
will suffice: (i) the indel rates (excluding the multiplication factors assigned to the inserted 
residues) are independent of the residue state and the substitution process before the indel 
event; (ii) the substitution rates at each site are independent of the states and the evolu¬ 
tionary processes at other sites; and (iii) the probability of the residue state of each inserted 
sub-sequence (conditioned on the insertion) can be factorized into the product of residue 
probabilities (at the time) over the inserted sites. This focusing on insertions/deletions 
signihcantly simplihes the problem at hand. 

Then, we also assume that the indel evolution model we deal with here satishes the 
’’sufficient and nearly necessary” set of conditions for factorable probabilities of ancestor- 
descendant PWAs [3], namely: (1) the rate of each (indel) event is independent of the 
region(s) outside of the site(s) it affect; and (2) the increment of the exit rate by each indel 
event is independent of the region(s) outside of the site(s) it affect. 


2 Underlying ideas and principles 

In a previous work of ours [1], we provided a pair of algorithms to compute the probabilities 
of isolated gaps, he., case-(ii) gapped segments having only ancestral sites and case-(iii) 
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gapped segments having only descendant sites. As a matter of fact, these probabilities were 
relatively easy to compute, because all the sites under consideration (in between a pair of 
preserved ancestral sites (PASs) belong either only to the ancestor or only to the descendant; 
in the former case, those sites are destined to be deleted eventually, and in the latter case, 
NONE of those sites existed at the initial time (i.e., in the ancestral sequence). This means 
that, when dealing with each case-(ii) or -(iii) gapped segment, we do NOT need to keep track 
of the origins of the sites (in between the PASs) during the evolution, and this effectively 
enabled us to focus only on the evolution of the number of those sites (or, equivalently, the 
length of the region). 

In contrast, each case-(iv) gapped segment consists of some ancestral sites and some 
descendant sites that are not homologous to any ancestral ones. Therefore, when considering 
the evolution of the region in question, we need to keep track of the origins of the sites, that 
is, whether each site existed from the beginning (of the time-interval) or was newly inserted 
at some point. This fact makes it much more difficult to compute the probabilities of of 
case-(iv) gap-patterns than to compute the probabilities of case-(ii) and -(iii) gap-patterns. 

Nevertheless, by carefully considering the situations in question, as well as the nature of 
insertions and deletions, the problem could be surprisingly simplihed, as we explain in the 
following. 

Let us consider a general indel history that results in a case-(iv) gapped segment {e.g., 
Figure 1). First, when focusing on the segment alone in a PWA, it consists of three kinds of 
ingredients: (i) two preserved ancestral sites (PASs), where an ancestral site is aligned with 
a descendant site, that flank the segment (ii) some ’’ancestral sites” that were deleted at 
some point in the time-interval; and (iii) some ’’descendant sites” that were inserted at some 
point in the time-interval. Next, when considering the indel history as well, there may also 
be (iv) ’’evanescent” (or ’’transient”) sites, which were inserted at some point and deleted 

Strictly speaking, these PASs do not belong to the gapped segment. However, when computing the 
probability, we will consider that the segment includes the PASs. 
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Figure 1: Time-trajectories of ancestral (red), descendant (cyan), and evanescent (yellow) sites 
in example indel history that results in case-(iv) gapped segment. Sequence states at different time- 
points, labelled sq, si,S 7 , are aligned just as in a normal multiple sequence alignment (MSA). Sites sharing 
the same ancestry are aligned vertically. The ’—’ in a white cell indicates that the sequence lacked the site 
at that time. The pair of blue dashed lines diverging from an ”X” represents an insertion. The pair of red 
dashed lines converging to an ”X” represents a deletion. The ”1, 2, ..., 9, A, ..., F” in the red cells are the 
ancestry indexes of the ancestral sites.The ancestry indexes of the inserted sites (cyan and yellow boxes), Vj 
(with j = 1 , 2 , ... 8 ), were assigned in the temporal order of creation. 


at some later point. 

For the moment from now, let us ignore the ’’evanescent” sites (as in Figure 2). When 
constructing a PWA, especially when the homology structure alone is important, we could 
completely segregate ancestral sites from descendant sites. When considering the time- 
trajectories of those sites, however, we observe that the ancestral sites and descendant sites 
were actnally intermingled with one another, as shown in Figure 2. This is because the 
sites in a (snb-)seqnence at each time-point must follow a strict spacial order; site A on 
the left of site B must NEVER be placed on the right of B. Taking another look at the 
time-trajectories of the sites (e.g., in Figure 2), we notice that, at each time-point, some 
ancestral sites are Inmped together against the backgronnd of surronnding descendant sites, 
to form a ’’cluster”; similarly, some descendant sites form a ’’cluster” against the backgronnd 
of snrronnding ancestral sites. By introdncing symbols ”A” and ”D” to denote a ’’cluster” 
of ancestral sites and a ’’cluster” of descendant sites, respectively, we could represent the 
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Figure 2: Time-trajectories of ancestral (red) and descendant (cyan) sites, when we ignore 
evanescent sites (shaded in gray). This figure differs from Figure 1 only in how the evanescent sites 
are colored/shaded. Now, the evolution of the ” A”/”D”-coloring-pattern has become clearer. The pattern 
evolution here is: ”A ”(so & si) ^ ’’ADA” (sa) ^ ’’ADADA” ( 53 ) ^ ”DADA” ( 54 ) ^ ”DA” ( 55 ) ^ ”D” 
(se & S 7 ). The timing slightly differs from that when evanescent sites are also taken into account (Figure 
3). 


sub-sequence at a time-point as an alternating concatenation of ”A”s and ”D”s, such as 
’’ADADA”. This way, the indel history of the sub-sequence could be broadly represented by 
a ’’time-series”, such as, ”A —)■ ADA —)■ ADADA —)■ DADA —DA —)■ D” (Figure 2). 

Now, let us incorporate the evanescent sites again, and pay attention to their surroundings 
(Figure 3). First, an evanescent site, or a lump of contiguous evanescent sites, is regarded 
as belonging to a ”D”-region, if any of the following is satished: (i) it was inserted within, 
or at the end of, a (lump of) descendant site(s) {e.g., vj in Figure 3); (ii) it was inserted in 
conjunction with one or more descendant site(s) {e.g., in Figure 3); or (iii) it spawned 
one or more descendant site(s) within, or at the end of, it {e.g., Vi in Figure 3). Second, 
the evanescent site(s) will not be regarded as either ”A” or ”D”, if they were created after 
the deletion of all ancestral sites and before the hrst creation of descendant sites. (These 
sites can still be incorporated into our framework without any problem. See below.) Then, 
each of the remaining evanescent sites is regarded as belonging to an ” A”-region, which 
should be uniquely determined at the time of the site’s creation (the surrounding of e.g., 
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Figure 3: Evolution of ”A”/”D”-coloring pattern of region in between PASs (”L” and ”R”), 
after taking account of evanescent sites again. In each sequence state, the dashed-bordered transparent 
rectangles colored in red and blue represent the ” A”-region and the ”D”-region, respectively. The pattern 
evolution here is: ”A ”(so) ^ ’’ADA” (si & sa) ^ ’’ADADA” (sg) ^ ”DADA” ( 34 ) ^ ”DA” ( 55 ) ^ ”D” 
(sg & S 7 )■ The timing slightly differs from that when evanescent sites are ignored (Figure 2). This figure 
was created by merely adding the ”A”- and ”D”-region indicators to Figure 1. 


in Figure 3). An important corollary of these rules is that a (lump of) evanescent site(s) 
always belongs to a ”D”-region if it was inserted at the boundary of an ”A”-region and the 
”D”-region {e.g., Vj in Figure 3); this will play an important role below, when dehning the 
’’base” rate operator acting on each region. 

Following the rules prescribed in the previous paragraph, the ”A/D-coloring pattern 
evolution”, or the ’’broad time-series”, is nearly uniquely assigned to each indel history 
resulting in a case-(iv) gapped segment (Figure 3). This means that we may accurately 
compute the probability of each case-(iv) gapped segment as follows: (i) classify all the 
indel histories resulting in a given segment into the ’’broad time-series”, (ii) compute the 
probability contributed by (all the histories belonging to) each ’’broad time-series”, and 
(iii) sum all the probabilities computed as in (ii). In practice, however, the ”A/D-coloring 
pattern” could be quite complex; for such complex cases, the probability computation might 
be too time-consuming to be practically feasible unless some smart time-saving algorithms 
are invented. Therefore, for the moment, it would be wise to deal with the problem via a 
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Figure 4: Example indel history with ” A/D”-coloring ambiguity. The same notations as in Figure 
3 are used here. The indel histories on the left and right are in fact identical to each other; the ancestral sites 
with indexes 3 and 4 and the descendant site with index cannot be uniquely ordered, because they are 
not homologous to each other, and because there are no sites determining their relative order. In this indel 
history, two lumps of evanescent sites, [^ 1 ,^ 2 ] and [^ 3 ,^ 4 ], were created. Depending on which of the lumps 
incorporates the descendant site (U 5 ), there are two ways of coloring the evanescent sites in ”A” (transparent 
red) and ”D” (transparent cyan), as shown here. 


sort of ’’perturbation method”,where the probabilities will be computed starting from the 
simplest patterns of ”A/D-coloring pattern evolution” and gradually moving on to complex 
patterns. 

In the following couple of sections, we will concretely construct a new ’’perturbation- 
method” that embodies the computation strategy explained above. 


^^As a matter of fact, there could be an ambiguity on the ” A/D”-coloring of evanescent sites, (probably 
only) in an exceptional type of indel histories, which satisfy the following conditions (see Figure 4 for an 
example): (1) two lumps of evanescent sites were independently created in an ”A”-region; (2) then, the sites 
in between the two lumps were deleted; (3) then, some descendant sites were created exactly at the boundary 
between the two lumps; and (4) no other descendant sites were created within, or at the end of, the two 
lumps of evanescent sites. In such indel histories, either of the two lumps could be colored ”A”, and each 
such indel history could be counted twice in the perturbation method unfolded below. This means that some 
correction should be done in order to rectify such double-counting and obtain perfectly exact probabilities. 
Fortunately, each such indel history inevitably involves at least six insertions/deletions (ie., an insertion- 
deletion pair for each lump of evanescent sites, a deletion (of intervening ”A”-sites), and an insertion (of the 
descendant sites)) that are in extremely exquisite spatial relationships to each other. Therefore, the total 
contributions from such indel histories are expected to be negligible in most practical analyses. 
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3 Basic Framework 


We concretize the above computation strategy founded on the theoretical formulation we 
provided before [3, 1], which can handle stochastic sequence evolution models generalizing 
Miklos et ah’s ”long-indel model” [2] (and also of the model used by a genuine sequence 
evolution simulator, Dawg [48]). For the problem at hand, i.e., regarding a case-(iv) gapped 
segment, it is sufficient to consider a sub-sequence (or a region) consisting of the two flanking 
preserved ancestral sites (PASs), labelled as ”L” (for left) and ” R” (for right) hereafter, and 
the sites in between them, labelled as 1, 2,..., AL{s), where the AL{s) is the number of sites 
in between the PASs and it varies while the sequence state, s, evolves. Aside from these 
labels, each of the sites, x = L,l, 2,..., AL{s), R is also assigned an ’’ancestry index”, denoted 
v{x) [3]. These indexes as a function of x also vary in the course of the evolution, because 
insertion(s)/deletion(s) could change the position (along the sequence) of the site with a 
particular ancestry. (Thus, if you want to make their time-dependence explicit, you could 
use the notation, v{x,t).) Since the sole role of these ’’ancestry index”es is to keep track 
of the time-trajectories of the sites involved, we could arbitrarily assign the indexes, { v{x) \ 
X = L,l, ■■■, AL{s), R }, as long as the sites of different origins carry different indexes. 
Because the sites L and R are preserved throughout the evolution under consideration, the 
ancestry indexes at these sites are also hxed. Thus, we will henceforce assign v{L) = L and 
v{R) = R. Then, the basic sequence state (concerning only insertions/deletions) is denoted 
as: 

(s I = ( [L,'i;(l),'i;(2),...,t;(AL(s)),i?] I . (1) 


(Or, 


{s{t) I = ( [L,v{l,t),v{2,t),...,v{AL{s),t),R] 


^®It should be noted, however, that each site must keep its unique ancestry index during its evolutionary 
course, from the beginning to the end. 
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to explicitly represent its time-dependence.) Next, we define the time-frame. Let the time- 
interval in question begin at ti and end at tp- Thus, we will consider that the time t always 
be within the interval, \ti , tp]- 

As described in section R3 of a previous work of ours [3], the stochastic time-evolution 
of this sequence is controlled/prescribed by the (instantaneous) indel rate operator, 
which can be decomposed as: 

= Q\t) + Q^{t), (2) 

Q^it) = QZ{t) + Qm {m = I,D). (3) 

And the actions of the components on the sequence state, Eq. 1, are: 

AL(s)H-1 oo 


s 1 

= 1 M/(a;,/) , 

x=—l 1=1 

(4) 


AL(s)H-1 +CX) 


s 1 g^(t) 

= ^ ^ rD{xB,XE;s,t){s \ Md{xb,xp) , 

(5) 


xb=—oo a;£;=max{0,tcs} 


1 Qm 

= -R'^{sR){s\ {m = I,D). 

(6) 


Here, as in [3], we tacitly assumed that the sub-sequence in question is embedded in an 
infinitely long sequence. And, for computational convenience, we replaced x = R and x = L 
with a; = 0 and x = AL{s) -|- 1, respectively, for the site numbers of the flanking PASs. 
The above equations use the same notations as in a previous work of ours [3]. Especially, 
Mi{x, 1) is the operator that inserts a string of length I between the sites x and x + and 
r/(a;, /; s, f) is the instantaneous rate that such an insertion occurs on the sequence state {s \ 
at time t; Mp,{xB,xp) is the operator that deletes the sites, x = xp through x = xp (both 
inclusive), and rp){xB,xp] s,t) is the instantaneous rate that such a deletion occurs on {s \ 
at time t. The insertion and deletion components, respectively, 

^^The difference from the equations in [3] is that the site number, x, starts with a; = 0 (and ends with 
X = AL{s) -I- 1 (= L{s) — 1, where L{s) AL(s) -|- 2 is the number of sites in the sequence state, s)) here, 
whereas it starts with a; = 1 (and ends with x = L(s)) in [3]. 
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of the exit rate, of the state {s \ at time t via insert ions/deletions. Specihcally, 

Rx^is,t) = Rxis,t) + Rx{s,t) , (7) 

AZ/(s)+l oo 

x=—l 1=1 

AL{s)-\-l +00 

Rx{sR) = X] X] rD{xB,XE-,sR) , (9) 

xb=—oo x£;=max{0,xe} 

These equations, Eq.l through Eq.9, provide the foundation for our strategy to compute 
the probability of each case-(iv) gapped segment. Using these equations, we could at least 
theoretically calculate the transition probability from a sequence state to another in between 
the PASs, L and R, just as described in a previous work of ours [3]. In general, however, such 
state transitions are not limited to those providing case-(iv) gap-patterns. To go further, we 
will resort to a sort of ’’perturbation theory”, as unfolded in the next section. 


4 New Perturbation Method to Systematically Com¬ 
pute Probabilities of Case-(iv) Gap-Patterns up to 
Desired Level of Accuracy 

4.1 Broad Account of the Procedures 

As explained in Section 2, the time evolution of a sub-sequence yielding a case-(iv) gapped 
segment follows a particular time-series of ”A(ncestral)/D(escendant)-coloring”, such as ”A 

^®The ’’perturbation method” unfolded here is somewhat similar to, yet significantly different from, the 
perturbation-theoretical approach in our previous studies [3, 1]; especially, the ’’base”-’’perturbation” de¬ 
composition of the rate operator, , differs between the two approaches. Here, the probabilities computed 
via the latter approach (so that the evolutionary histories include as much indels as desired) are regarded as 
’’exact”-solutions (under the ’’base” rate operator here), and thus the term, ’’perturbation”, always refers to 
that in the method constructed here. 
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—)■ ADA —)■ ADADA —)■ DADA —)■ DA —)■ D” (Figure 3); especially, it must always start 
with ”A” (ancestral only) and end with ”D” (descendant only). Now, we slice the broad 
evolutionary history at the times, tk with k = 1,2,Nsi — 1, when the ” A/D”-coloring 
patterns of the sequence change (Figure 5). (The Nsi above is the number of slices in the 
broad history.) Then, each slice, e.g., in an open time-interval, (tk,tk+i), shows a hxed 
”A/D”-coloring pattern, e.g., ”DADA”, in which ”A” and ”D” alternate. This suggests 
the strategy for computing the total probability that each particular time-series of ”A/D”- 
coloring occurs: (i) within each slice, decompose the rate operator Q^^it) into the ”base”- 
and ’’perturbation”-parts; (ii) within each slice, compute the probabilities of state transitions 
through each slice under (or conditioned on) its fixed ”A/D”-coloring pattern, using the 
’’base”-part of Q^^(t); (iii) between two consecutive slices, compute the probabilities of the 
’’coloring-pattern-transition” from each slice to the next, by multiplying the rates of the 
particular insertions/deletions that changes the ”A/D”-coloring pattern(, which probably 
belong to the ’’perturbation” part); (iv) stack the probabilities of the slices (via (ii)) and the 
’’coloring-pattern-transitions” (via (iii)), from the beginning to the end, and integrate over 
the times, (ti <)ti < t 2 < ... < tNgi-i{< tp), to get the probability contributed from the 
particular pattern of ” A/D”-coloring evolution to the case-(iv) gapped segment. 

4.2 ’’Base”-’’Perturbation” Decomposition of Rate Operator 

Within each slice of a hxed ” A/D”-coloring pattern, we can formally decompose the rate 
operator Q^^{t) into the ’’base”-and ’’perturbation”-parts. Hereafter in this subsection, we 
will consider a slice in the open time-interval, {tk,tk+i), and refer to it as ’’slice k” . The k 
can range from 0 through Nsi — 1, where to ti and tATg, ti? are the initial and hnal 
times, respectively, of the whole time-interval considered here. Let Nc{k) be the number 
of colored regions in slice k, and let C* (with i = 1,..., Nc{k)) be the t-th colored region, 
numbered from left to right. For example, in the above case of ”DADA” (for slice k (=4)), 
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Figure 5: Slicing ” A”/”D”-coloring pattern evolution. This figure was created from Figure 1 by 
adding the times, ti, at which the ”A/D”-coloring-pattern changes, and by removing the state labels 

{so,...,s^) for clarity. To clarify the slices, each of the time is accompanied by a triangle on the time-axis 
and a thick dashed horizontal line. 


Nc{k) = 4, Cl and C 3 represent the 1st and 2nd ”D”-colored regions, respectively, and C 2 
and (74 represent the 1st and 2nd ” A”-colored regions, respectively. Hereafter, the ”D”- and 
” A”-colored regions will be abbreviated as the ”A”- and ”D”-regions, respectively. And let 
xsii) and XEii) be the site numbers at the beginning and end, respectively, of the region Ci. 
Especially, xsii + 1) = XE{i) + 1, a;s(l) = 1 and XE{Nc{k)) = AL(t) always hold. Then, we 
define the ’’base” rate operator, for the region ( 7 *, as follows: 

Qo^{i]t) = + Q^+ Q^o^x{k,t), (10) 


XE{i)-l+o-Eii) 00 


^ Qo m(f ~ 

Yri{x,l;s,t){s \ Mi{x,l) , 

( 11 ) 


x=XB{i)-o-B{i) 1=^ 


^ Qo m(f Q ~ 

Exs Exs roixE, Xe] s, t){s 1 Md{xb, Xe) , 

( 12 ) 


{xB,XE)¥=(xB{i),XE{i)) 



-AR^^{s,Cifi){s\ . 

( 13 ) 


First note that no deletions in the deletion component, Eq.l2, stick out of the region Cf, 
thus, this ’’base” rate operator focuses on the deletions occurring totally within the region. 
Also note that Eq.l 2 does not include the deletion of an entire Ci (i.e., ME{xB{i),XE{i))', 
such a deletion will be included in the ’’perturbation” part (dehned below). In the insertion 
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component, Eq.ll, each of asii) and asii) takes only the values 0 and 1. These values are 
determined to realize the affiliation rules for the ’’evanescent” sites prescribed in Section 
2. Specihcally, asii) = o^Eii) = 1 if Q is a ”D”-region; if Ci is an ”A”-region, aBii) = 1 
still holds if i = 1, and = 1 still holds if i = Nc{t)] in the remaining cases, each 

of them equals 0. In the exit-rate component, Eq.l3, AR^^{s,Ci,t) {s,{s\Q},t) 

(=^ R^xi^it) ~ \ Ci},t)) is the increment of the exit rate caused by the existence of 

the region Ci in the sequence state s (at t). 

In fact, the insertion and deletion components of the ’’base” rate operator, Eqs.ll & 12, 
are dehned so that (5o^(f;t)’s with different i’s do not interfere with each other, provided 
that Condition (i) in section R6 of a previous study of ours [3] is satished between different 
Cj’s. Besides, the exit rates, Ai?^(s, Cj, t)’s, with different i’s are also independent of each 
other, provided that the Condition (ii) for the factorability of the PWA probabilities (in 
section R6 of [3]) is satished. If these conditions hold, we actually have: 

Ned) 

R^£{s,t)=RlP{[L,R],t)+ ^Rxis,C„t). (14) 

i=l 

Because i?^([L,R],t) is like a ’’constant” independent of the specihe state of the sub¬ 
sequence (as long as it has the PASs, L and R), it would be convenient to dehne an operator, 
(5o^(0;t), such that 

{s \ Q^,^j,{0;t) = -R^£{[L,R],t){s \ . (15) 

Then, for a positive Nc{k), we define the ’’base” total rate operator as follows: 

Nc(k) 

Ql^{t) = Qo\{0-, t)+ ^ t). (16) 

i=l 

From the above arguments, we see that the Qq^Iv, t)’s with different i’s do not interfere with 
one another. Thus, under the aforementioned conditions, they could be time-integrated 
independently from one another, to get the ’’base” hnite-time transition operator, Pg'^(t,C), 
^®Here, the {s \ CR represents the sequence state formed by removing Ci from s. 
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as follows: 


Po^{t,t') = T<(exp( I (irQ^^ir) 

Nc{k) 


pQ^{0-,t,t') X Yl , with 


ylDt 


■)IDi 


i=l 

dID, 


def 


WD/ 



(17) 

, with 


X i , 

(18) 

(with i = 1, ...,Nc{k)) . 

(19) 


Here, T {...} represents the time-ordered product of operators in which an operator at time 
r comes on the right of other operators acting earlier than r and on the left of others acting 
later than r. And the ”1” above is the identity operator. 

The ’’perturbation” part, denoted here as AQ^^(t), is simply dehned as: 




( 20 ) 


Because the ’’base” operator, Qo^(t) in Eq.l6, includes the entire exit-rate component, 
Q}c’{t)(= + QxiP)^ entire insertion-mutation component, Ql^{t), 

AQ^^{t) is purely a linear combination of the deletion operators representing deletions ex¬ 
tending into two or more regions, as well as deletions of the whole regions of single ”A”s and 
single ”D”s. As it is, the explicit expression of AQ^^{t) is considerably complex. 

Fortunately, the fact that we are now dealing only with a single case-(iv) gapped segment, 
as well as the natures of ”A” and ”D” regions, will considerably simplify the expression of 
the ’’effective part”, of the perturbation part, which collects all those deletions 

which could actually occur in the indel histories we want, as follows. First, we can ignore 
all deletions that affects L and/or R, because, by definition, the PASs, L and R, have never 
experienced deletions. Second, we can also ignore all deletions each of which deletes a whole 
”D”-region, because the ”D”-regions, also by definition, must NOT be deleted entirely. (If 
so, they are just lumps of evanescent sites.) This also means that a deletion can always 
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Figure 6: Deletions included in the ’’effective” perturbation part of the rate oper¬ 

ator. a. The deletion of an entire ”A”-region alone, b. The simultaneous deletion of an entire ”A”-region 
and parts of the flanking ”D”-regions, c. A ’’boundary-eroding” deletion. In each panel, the red and cyan 
rectangles represent an ” A”-region and a ”D”-region, respectively. 


be ignored if it deletes two or more contiguous regions, as ”A” and ”D” always alternate. 
Taking account of these two restrictions, we see that the constituent deletion operators of the 
’’effective part”, can be classihed into two categories: (i) ”A-deleting” deletions, 

each of which deletes an entire ”A”-region (Figure 6 a), and may simultaneously delete (a) 
part(s), but never the whole, of the flanking ”D”-region(s) (Figure 6 b) and (ii) ’’boundary¬ 
eroding” deletions, each of which straddles two contiguous regions, but does NOT delete any 
whole regions (Figure 6 c). These results can be expressed as follows: 


^QEffi^) — Qn-.A-deli^) + QM-.B-eri^) i 

Ncik) 

QM:A-delit) = ^(color(i),A) gM:A-del(h^) , 

i=l 

(■^ I QM:A-del(F = E E rD{xB,XE-,s,t){s I Md{xb,Xe) , 

XB=XB{i-^) + l XE=XE[i) 

Nc{.k)-l 

QM-.B-er it) = QM:B-er(F^), 

i=l 

XE(i) XEii+l)-i 

(-5 I QM:B-eritj t) = E E rE>ixB,XE;S,t){s I MoixE^XE) ■ 

XB=XB{i) + ^ XE=XB('i+l) 


( 21 ) 

( 22 ) 

( 23 ) 

( 24 ) 

( 25 ) 
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Here, the subscripts, ”A-del” and ”B-er”, are short for ”A-deleting” and ’’boundary-eroding”, 
respectively. In Eq.22, the 5(color(i), A) is an analog of Cronecker’s delta, and (5(color(i), A) = 
1 if Ci is an ”A”-region, and = 0 otherwise. In Eq.23, we dehne 0 : 5 ( 0 ) = 0 and XE^Ncik) -|- 
1) = AL{s) + 1, to cover the cases where color(Ci) = A and color (CArp(fc)) = A, respec¬ 
tively. As indicated by their dehnitions, Eqs.23 and 25, the QM:A-dei(b^) ^^e collection of 
all deletions deleting the entire ”A”-region, Ci, and the Qm-.b-bA^'^) is the collection of all 
’’boundary-eroding” deletions affecting Ci and Cj+i. 

For completeness, let us hnally consider slices with Nc{k) = 0, which could occur if all 
ancestral sites are deleted before any ”D”-regions are created. Such a slice consists only 
of evanescent sites, and it begins and ends with the sequence state, [L,R\. Therefore, the 
total probabilities for such a slice and a series of such slices (with various time-intervals) are 
exactly equal to those of case-(i) gapped segments, which can be computed using the whole 
rate-operator, Q^^it) in Eq.2, exactly as described in a previous study of ours [1] (, or, via 
a faster method identical in essence to that in subsection 4.3). 

4.3 Computing Transition Probabilities within Each Slice 

If we want to compute a transition probability within each slice, say, slice k in {tk,tk+i), we 
Erst need to specify the ’’initial” and ”£nal” lengths of the regions, Ci with i = 1,..., Nc{k) 

, at times tk + e and — e, respectively. (Here, e is an inhnitesimal 

value.) This is equivalent to specifying the ’’initial” and ”£nal” values of xeA), with i = 
1,Nc{k); let xeA^F) and xeA^F) (both with i = 1,Nc{k)) be such ’’initial” and 
”hnal” values, respectively. Similarly, let xbA',!) and xbA'^F) be the ’’initial” and ”£nal” 
values, respectively, of xbA) (with i = I, ...,Nc{k)). More generally, let xbA'A) and xeA'A) 
be the beginning and end coordinates, respectively, of the region Ci at time t in {tk,tk+i). 
(See Figure 7 for a schematic illustration of the situation considered here.) 

As in the previous subsection, we consider that Conditions (i) and (ii) in section R6 of 








^P:(3;t;,t) = P:(3;t;,t)'Q:(3;() 




Figure 7: Sequence evolution (via indels) under ’’base” rate operator, within slice k (in 

{tk,tk+i))- As before, the red and cyan rectangles represent an ”A”-region and a ”D”-region, respectively. 
Each thick, light-colored downward arrow suggests that the stochastic evolution of the corresponding region 
(Ci) is dictated by the relevant ’’base” rate operator 


a previous study of ours [3] are satisfied between the ’’base” rate operators, Qo^(f;t)’s, on 
different Cj’s. Then, in the current perturbation framework, the ”zero-th approximation” 
of the transition probabilities within slice k can be obtained by sandwiching the factorized 
form of the ’’base” transition operator, Eq.(17), with the initial and final sequence states. 
As far as this ”zero-th approximation” is concerned, we could express the sequence state, 
{s I in Eq.l, as a ’’tensor product”: 


Nc{k) 


(sw 1 

= (so 1 ® 

{s{Ci;t) \ , with 

i=l 

(26) 

(■So 1 

= ([^,^]l 

•) 

(27) 

{s{Cp,t) 1 

= {[ v { xb {^ 

■,t),t),...,v{xE{r,t),t)]\ . 

(28) 


Here, v{x,t) is the ancestry index of the x th site at time t. Hence, using this ’’tensor prod¬ 
uct” expression, the ”zero-th approximation” of the transition probability can be calculated 
formally as: 


a [b(*p+i).‘p+i) I b(y).y)] = (dC) I i 
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2 = 1 


s(C.i«j^_i)>.(29) 


To attain the second equation, we used the fact, (sq | -Po^(0; t, t') \ sq) = exp dr i?;^([L, -R], r) 

Therefore, the problem of computing the ”zero-th approximation” of the transition proba¬ 
bilities within each slice can be reduced to that of computing the multiplication factors, or 
transition probabilities within each colored region, denoted here as: 


^lP, I (s(C'i;t+),t+)] = {s{Ci]t+) \ | s(C'p4+J) , (30) 


for each Ci (with i = 1,Nc{k)). (To remember the situation under consideration, see 
Figure 7.) 

For a locally space-homogeneous model, whose insertion/deletion rates are homogeneous 
within each of most gapped segments, we derived a system of iterative equations to give 
effectively exact multiplication factors for isolated gaps. (See section SM-3 in additional 
file 1 of a previous work of ours [1].) It should be noted that, within each Ci and under 
the ’’base” rate operator, we can forget about the origins (or ancestries) of the sites in the 
region, because all of them will eventually be deleted anyway if Ci is an ” A”-region, and 
because none of them belonged to the ancestral (or initial) sequence state, anyway, if Q is 
a ”D”-region. It follows that, under a locally space-homogenous model, the state, {siCpf) |, 
depends only on the number of sites in Ci, i.e., AL{r,t) XE{i',t) — XB{i',t) + 1; we can 
thus express this fact explicitly as: {siCpt) |= {AL{i;t) |, just as in the derivation for an 
isolated gap. Moreover, AR(P{s,Ci,t) also depends only on AL{i;t) (and possibly t); let 
us remember this fact by using an ’’alias”, AR^^{AL{i-,t),t), of AR^^{s,Ci,t). Therefore, 
following almost exactly the same procedure (as in SM-3 of a previous study of ours [1]), we 

the current ’’perturbation” method, the ”zero-th” approximation does not mean that there are no 
insertions/deletions during the time-interval, (tfcTT+i)! actually, there could be any numbers of inser¬ 
tions/deletions, as long as they occur totally within each of the regions (C^’s) and as long as they do 
not delete any of the entire Cds. 
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could derive a system of iterative equations also for effectively exact multiplication factors 
for each C* in slice under a locally space-homogenous model. 

The only differences from the previous method [1] are: (i) the exit rate used here is 
the increment, AR^^{s,Ci,t) (see the description below Eq.l3), instead of the whole rate, 
R^jP{s,t) (in Eq.7); (ii) both initial and hnal states have nonzero sites in between L and R, 
whereas, for an isolated gap, either of them has no sites in between; and (iii) the ’’effective” 
rate operator differs slightly, specihcally, the one used here lacks the deletion of the entire Q, 
and it may also lack insertions at both or either end(s) if C* is an ”A”-region. (See appendix 
A for the specihc computation along this line.) 

Here, however, we provide a much faster approach; it is based on a direct approximation 
of the following dehnition: 

pQ^{i;tF,ti) = T jexp dr Ql^{i-T) | 

lim (i + A^pt ■ ■ ■ ■ (i +A^pt ■ Ql^{i]fNp)) , (31) 

Np^cxi V /V / 

where tj = tj + {j — l/2)ANpt {j = 1,2,..., A'p), with A^pt {tp — ti)/Np. This 

should be realized with an algorithm of space-complexity 0{NpL^^) and time-complexity 

0{Np{L^OY). 

Now, for a specihc region Ci in a given slice in (f^, tk+i) (or in [f^, t^_,_J), let us compute the 
transition probabilities (or the multiplication factors), yUp^ [{AL{r, F), ) I (AL(*i/),«+)] (ffBi5(i)), 

with ranging AL{i;I), AL{i;F), and [t^Rk+il ^ [tiRp], by directly approximating the deh¬ 
nition, Eq.31, of the hnite-time transition operator. The hrst thing we need to do is discretize 
the time-interval, [t 7 ,tp], with a number of (usually equal-spaced) points in between ti and 
tp; Here, we choose, tj = tj + j ■ A^pt (j = 1, 2,..., Np — 1), with Apfpt (tp — ti)/Np. 

And we set to tj and t^p ^ tp. Then, we dehne a series of ’’discretized” hnite-time 
transition operators: 

pm drf ^ _ Qo^{r, H)) ■■■(! + Apipt ■ Qo^{r, t,)) (32) 

^^The tj’s defined here are related to the ij's defined above via the relations, tj = {tj-i + tj)/2. 
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for j = 1,2, ...,Np, and Pq^ = 1. To express this definition more precisely, we 

can nse the recnrsion relation: 

piD ^ piD ^ _ (0o^(i; tj)) (33) 

for j = 1,2, ...,iVp, and, again, t/, to) = 1- 

Next, we dehne the mnltiplication factors: 

d?' [(s(C.;0).*j) I (i>(C,;0),0)| =' (s(C.i«,) I I i>(C,;0)> , (34) 

for j = 0,1,..., Np. Then, by snbstitnting Eq.33 (for j = 1,..., Np) into the right-hand side 
of the above eqnation, and nsing X]s(Ci q-i)eS(Ci) I 1= i() where S(Ci) 

denotes the space of all possible states in the region Q), we get: 

= [(^'’Vi) I {s{Ci]ti),ti)] X 

s'e5(cp L 

X (^(5(s',s(Ci;tj)) + Apfpt- {s' \ \ siCptj))^ . (35) 

Here, we set s' = s{Ci]tj-i) for simplicity. This gives the recnrsion relations for the ’’directly 
approximated” mnltiplication factors at time-points, ti,t 2 , tp), and the relations 

can be compnted by iteration. As argned above, in the locally space-homogeneons model we 
deal with, the state within each region (Cj) is determined by the nnmber of sites, AL{i;t), 
in between the PASs. Thns, nnder the same setting as for Eq.56, the recnrsion relations, 
Eq.35, become: 

I (AL/,t/)] {apEii)) 

= (^1 — Apfpt ■ AR^x {^Lj, tj)^ ■ [{ALj, tj-i) I {ALj, tj)] (cTp£;(i)) 

min(L^®, ALj — 1) 

+ANpt ■ I {ALj — I — 1 + (JBE{i)) X gi{liij) X 

i=i 

[{^Lj - I {ALi,ti)] {(JBE{i)) I 
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min(LgO, L^°-ALj) 

+ {ALj + 1 ) '^2 I 9D{l,ij) X 

i=i 


[i^^j + l,tj-i) I (AL/,t/)] (aBE(i)) I 


with the initial condition: 


(36) 


[(ALo,to) I (AL7,t/)] {aBE{i)) — S{ALo, ALj) . 

(It should be noted that tj = to-) Here, for clarity, we used the short-hand notations, 
ALj = AL{r, I) and ALj = AL{i;j). Here, we also used the fact that the dependence 
of the factors on the region Ci is only through (JBE{i) (=^ o'_b(^) + o'e(*))- When t/ is 
hxed, the recursion relations, Eq.36, with ranging ALj, ALj, and tj has the space- and 
time-complexities of 0{Np{L^^}‘^) and 0{Np{L^^}^), respectively. If the model is time- 
homogeneous as well, this will be sufficient, because depend on tj and tj only 

through tj — tj. If the model is not time-homogeneous, however, we also need to compute 
the Eq.36 with ranging tp If such computations are performed serially, the total time- 
complexity becomes 0{{Np}‘^{L^^}^). 

Alternatively, we could extend the time-interval backward, from [tF,tF]. See appendix B 
if you are interested in the specihc expression of the resulting recursion relation. 

To further raise the level of approximation, we need to switch on QM-.B-eriit) in Eq.21, 
because this term also preserves the ” A/D”-coloring pattern of the sub-sequence. 

Although we could, at least theoretically, directly approximate the dehnition of the hnite- 
time transition operator: 

4^+B-er(b t') = T I exp 

such computations will get practically impossible as the number of colored regions increases 
^^The above recursion relation is valid for ALj = 1, , and AL/ = 1, ALj = 0 needs to 

be excluded because QQ^{i;t) does NOT include the deletion of an entire region; and AL/ = 0 is excluded 
because we are here considering the evolution of a colored-region after its creation. 



Qo^ir) + QZ:B-eri 


^) 


( 37 ) 
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(to, e.g., 3 or 4), because we need to keep at least multiplication factors in the 

memory during the computation for the time-slice with Nc colored regions. 

In this regard, a more promising approach would be to iteratively solve the integral equa¬ 
tion in which the ’’base”-operator is Qo^{t) and the ’’perturbation” operator is Qm-.b-sA^)-: 
because the factors, (5 M:B-er(b(with i = — 1), could be dealt with separately. 

See appendix C for more details. 

[NOTE:] In this study, however, we regard the ’’boundary-eroding” deletions in Eq.25 also 
as ’’coloring-pattern-changing events”, and deal with them as ’’perturbations” just as the ”A- 
deleting” deletions in Eq.22 and the ”D”-creating insertions (in the ’’base” rate operator). 
This means that, hereafter, the evolution of each colored-region in each slice is described by 
the ’’base” rate operator, (dehned in Eq.lO), and the resulting ’’base” multiplication 

factor, ppo [(AL(dF),f“^J | (AL(d/), )]. 

4.4 Computing Probability of Change in ” A/D”-Coloring-Pattern 

Broadly speaking, changes in the ”A/D”-coloring patterns can be classihed into two cate¬ 
gories: (a) deletion of an ” A”-region, possibly accompanying the size reduction of either or 
both of the flanking ”D”-regions, as well as their merger; and (b) creation of a new ”D”- 
region, possibly accompanying the split of the ’’parent” ”A”-region into two ”A”-regions. 
Changes in category (a) are caused by the ”A-deleting” rate operator, QM:A-dei(^) Eq.22. 
In contrast, changes in category (b) are caused by the insertion operators on the ” A”-regions 
in Qo^(t) (in Eq.l6). (In addition, in this study, we include (c) ’’boundary-eroding” dele¬ 
tions, caused by Qu-.B-erA) Eq.24, into the ’’perturbation”. ) 

Deferring the summation of the contributions from different indel operators to the next 
subsection, here we focus on the computation of the action of each single indel operator. Each 
rate operator that accommodates the indel operator always takes the form of a distribution, 
which originates from the indel rates that the indel operators are multiplied by. Therefore, 
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the action of a rate operator, e.g., QM:A-dei(^) a ’’moment”, t, always occurs in the form, 
dt Q^-K-de\{^)i where dt is an inhnitesimal time-element. In numerical computation, the 
minimum time-width, denoted here as At (e.g., = {tp — ti)/Np), or its multiple in some 
cases, will play the role of dt. When a pattern-changing indel occurs at (actually, closely 
around) time tk, we pick the minimum time-interval, [ta,ta + At], that encompasses tk- We 
assume that the indel occurs somewhere within the aforementioned time-interval, but that 
we do not know exactly when. Naively, the action of the relevant term in dt QM-.A-deii'^) 
dt Ql^(t)) should be the combination of the following: 

(1) deleting a specihed set of sites, or inserting a specihed number of sites at a specihed 
position, in the sequence state before the indel; and 

(2) multiplying the probability (computed up to there) by At times the relevant indel 
rate. 

When At is sujficiently small, i.e., At 1/| Ai?^'^(AL, t)\, where AL is the length change 
caused by the relevant indel, this naive prescription works well. If, however. At is relatively 
large, e.g., At x \AR^^{AL,t)\ > 0(1), this may not be the case, because other indel events 
can get involved with a non-negligible probability, especially when a long subsequence is 
inserted/deleted. In order to keep the high accuracy of the computation even when At is 
relatively large, one way would be to use a sufficiently small time-interval , say, A't(<^ At), 
only to compute the probabilities of ” A/D”-coloring changes, and to graft them with the 
transition probabilities (for the time-lapse At — A't) under an unchanged ”A/D”-coloring 
pattern. This should work as long as the ” A/D”-coloring changes are relatively rare. 

4.5 Computing Total Contributions from Each Pattern of ”A/D”- 
Coloring Evolution 

In the previous subsection, we focused on the contribution from each single coloring-pattern- 
changing insertion/deletion event. Here, we consider the collective effects of each type of 
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a. “D”-region creation 


b. “A”-region deletion 



Figure 8: Partial coloring-pattern evolution underlying multiplication factors on ”D”- 
region creation and ”A”-region deletion. a. Evolution underlying the ”D”-creation fac¬ 
tor, MD-cr[(^r, [t, t -I- At]) ; (A^Ljt')], in Eq.38. b. Evolution underlying the ”A”-deletion factor, 
MA-del[(A/E, t') ; {0,—SALfD, [t — At,t])]{aBE{= 0)), in Eq.41. As in previous figures, the red and cyan 

def 

rectangles represent an ”A”-region and a ”D”-region, respectively. Note that, in panel b, SALfjj = 
SAL^fB + SALiifD- See the text for more details on each setting. 


such insertions/deletions . (As in the previous subsections, a locally space-homogeneous 
model is assumed also here. Thus, is the rate of an insertion of length I at time t, 

and g£){l,t) is the rate of the deletion of a length I sub-sequence at time t; these rates are 
independent of the positions of the events as long as they occur in between the PASs, L and 
R.) 

With the current dehnition of the ” A/D”-coloring (given in Section 2), the creation of 
a ”D”-region is relatively easy to deal with, because each ”D”-region is created by a single 
insertion event, and because all sites inserted by the event belong to the ”D”-region at its 
beginning (Figure 8 a). It should also be noted that the insertion event creating the ”D”- 
region belongs to the ’’base” rate-operator, Oo(*a; t)i where r is the time of the creation, and 
Ci^ at r is the ”A”-region from which the ”D”-region was created. Hence, the probability 
(more precisely, the multiplication factor) that a ”D”-region(, referred to as Cj^,) was created 
at a specihc site, say x in at some time in a (small) time-interval, [t, t -|- At], and that 
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the region evolved, without vanishing, to have ApL sites at a later time, t', is calculated as: 


hD-cr [(a:, [t, t + At]) ; {ApL, t')] 

°° r pt+A.t 

= ^ / dr gi{l, t) ■ [{ApL, t’) \ (/, r)] {apE = 2) . (38) 

i=i Lat 

On the left-hand side, we omitted the trivial dependence of the /iD-cr[- ■ ■] on Ci^ and Ci^. 
And the /ipp)- ■ ■] on the right-hand side can be computed as described in subsection 4.3, with 
o'pp (= o'p(tp)) -|- aEiin)) = 2, as explicitly shown as an argument of /iPo[- ■ ■]• 

When numerically computing Eq.38, we apply the prescription given in subsection 4.4, 
and also set the upper-bound of the insertion length. When At is sujficiently small, the 
right-hand side of Eq.38 can be approximated as: 

LfO 

At X t) ■ /ipg [(ApL, t') I (/, t -l- At)] {(jpE = 2) • (39) 

i=i 

And, when At is relatively large, the measure explained in subsection 4.4 is specihcally 
expressed as follows (using A't(<^ 

hD-cr [(a:, [t, t + A't]) ; (ApL, t')] 

tCO 

OO r 

~ ^ A't X ^gi{l, t) ■ fj,p^ [(/', t + At) I (/, t + A't)] {apE = 2) 

Z'=l L 1=1 

x/ipo [(ApL, t') I (/', t + At)] {apE = 2) . 

LfO 

= A't X t) ■ fJ'Po [(ApL, t') I (/, t + A't)] {apE = 2) . (40) 

i=i 

The space-complexity required to store all these factors is 0{Np L^‘^) (except for /ipg [■ ■ -(’s 
that require 0(Ap{L'"‘^}^) memory space), and the time-complexity required to compute 
these equations is 0{Np making the computation feasible. Because the relevant 

”D”-region did not exist before t, the accompanying multiplication factor is 1 (unity) in this 
case. 

Next, we will deal with a relatively difficult case, which is the deletion of an ”A”-region 
(Figure 8 b). As indicated by Eq.23, the deletion of an ”A”-region could, in general, ad- 
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ditionally delete some sites in the flanking ”D”-regions. Therefore, when computing their 
contributions to the probabilities of case-(iv) gapped segments, the number of such addi¬ 
tionally deleted sites must also be recorded. Let SALifo and SAL^fD, respectively, be the 
numbers of such deleted sites in the left-flanking and right-flanking ”D”-regions. In a general 
insertion/deletion model, both SALLfn and SALufc must be recorded, which means that we 
must store 0{Np factors, which may be quite hard on the computer(s). (Note that 

one 0{L^^) comes from the ’’initial” length of the ”A”-region to be deleted.) Fortunately, in 
a locally space-homogeneous model considered here, after factoring out the transition proba¬ 
bilities for the flanking ”D”-regions, the remaining factor depends on the number of deleted 
”D”-sites only through the summation, 6ALfp, = SALpfp, -\- hALpfp, because the deletion 
rate, gp{l,t), depends only on the deletion length, i.e., the total number of sites deleted. 

Thus, the factor we need to store should be the probability (more precisely, the multi¬ 
plication factor) that an ”A”-region(, referred to as Q^,), which had A/L sites at time t', 
was later deleted at some time in a (small) time-interval, [t — At, t], simultaneously with the 
deletion of SALfp sites in the flanking ”D”-region(s). It is expressed as: 


hA-dei [{AjL, t') ; (0, —SALfp), [t — At, t])] {apE) 


E 


1=1 


/t-At 


dr gofl + SALfD,T) ■ ppg [{I,t) \ {AjL,t')] {a be) 


(41) 


Here, again, on both hand sides, we explicitly recorded the dependence of the multiplication 

def 

factors on a be = which could significantly change the results. Especially, it should 

be noted that only SALfp, = 0 is meaningful when aBE = 2, because there are no flanking 
”D”-regions in such a case. 

In the numerical computation, we do as in subsection 4.4 and set the upper-bound, , 
of the deletion length. When At is sufficiently small, the right-hand side of Eq.41 can be 
approximated as: 


L%0-5ALfB 

At X E \9d{1 + SALfD, t) ■ /ipo [(/, t — At) I (A/L, t')] (aBE) 


1=1 


(42) 
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When, At is relatively large, the measure explained in subsection 4.4 is specihcally ex¬ 
pressed (using A't(<^ just as in the ’’creation” case above: 

/^A-dei [(A/L, t') ; (0, —SALfD, [t — A't, t])] (ase) 

L%0-&/^LfD 

~ A't X 5f£)(/-|-(5 ALj£), t) ■/ipg [(/, t — A't) I (A/L, t')] ((Tb£;) . (43) 

i=i 

If nothing else is done to effectively adjust the exponential factor, any of these factors, 
Eqs.41, 42, 43, could be accompanied by the multiplication factor: 

expj-^ dTAR^^{ALafD{t - At),T)^ , (44) 

where ALajof — At) is the total number of sites in the affected flanking ”D”-region(s) at 
time t — At. 

Now, at last, we can compute the probabilities (or their total) of the insertion/deletion 
histories having a given pattern of ” A/D”-coloring evolution. As described in the beginning 
of this section (section 4), we hrst identify the times at which the coloring-pattern changes, 
and slice the coloring pattern history at these times. Then, for each colored region, say 
Ci in each slicegiven by an (open) time-interval, e.g., {tk,tk+i), assign 

hPo [(A^(b^fc+i)>^fc+i) I {^L{i-,tl),tl)] {aBE{i)) 

if it remained existing at both tk and tk+i, 

fJ'D-cr \{x,[tk,tk At]) , (AT(i, t^_l_j^), t^_l_j^)j 

should be noted, however, that the introduction of multiplication factors like this is equivalent to 
assuming that no indel events hit the relevant regions during the time-interval in question ([t — At,t] in 
the above case). Such an assumption should hold well if At • \ AR^^{AL,t)\ <C 1, where AL is the total 
length of the relevant regions. Otherwise, it would be better to ’’pad” the time-interval with the transition 

probabilities under the unchanged ” A/D”-coloring pattern, though it might sometimes be time-consuming. 

^^Remember that we also regard the ’’boundary-eroding” deletions in Eq.25 as belonging to ’’pattern¬ 
changing” events. 
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Figure 9: Colored region extending across slice boundary. When a colored region (in this case, 
Ci{k — 1) or Cj{k)) extends across the boundary (tk) between two slices {k — 1 over {tk-i,tk) and slice k 
over (tk,tk+i)), the multiplication factor for this region can span these slices. All you have to do is match 
the region lengths at the boundary, and sum over the lengths, as shown in this schematic equation. In 
this figure, the notations j{k — 1) and i{k) represent the j th region in slice k — 1 and the i th region in 
slice k, respectively. Although an ” A”-region (red) extends across the boundary in this figure, a ”D”-region 
(cyan) can also extend similarly. (Remember that, for an ”A”-region to extend, its asE needs to remain 
unchanged.) 


if it was created immediately after and 

f^A-dei[{^L{i;t^),t^) ; (0, —5ALy£), — Af, f^+i])] (ube) 

if it was deleted immediately before ffc+i- Then, at each time tk, sum over all possible 
lengths, AL(t;f^)’s and AL(j,t^)’s, of the regions involved (C'j(/c)’s in the slice after 4, and 
— l)’s in the slice before tk). If a colored region extends across the time tk without being 
affected by any perturbations (Figure 9)(, which means that even a be remains unchanged), 
the relevant summation (and the accompanying ’’matching” of the pair of lengths across tk) 
”join”s the multiplication factors before and after tk, giving a multiplication factor dehned 
in the joint interval, say, (tk-i,tk+i). 

This way, we are left only with the summations at the ’’perturbation” events, each re¬ 
garding the regions affected by the event. After computing these summations, we can, at 
least in principle, obtain the total contribution for a given pattern evolution by ’’integrating” 
over the times of the events. In practice, however, especially when multiple events are in- 
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volved, the required multiple-time integration could be extremely time-consuming, if naively 
performed. It should therefore be desirable to, whenever possible, deal with the ’’pertur¬ 
bations” one after another, each time computing the summation and the time-integration 
required for each event, to store the ’’interim” results, and then to re-use them for the ’’next 
step” of the computation. This strategy is somewhat similar to that of the ’’pruning” (aka, 
’’peeling”) algorithm for the computation of the likelihood of a phylogenetic tree under a 
given substitution model, given a sequence data set [59] [60] [61]. In the next section, we will 
apply these procedures to the concrete computations of the probabilities when the number 
of perturbations are relatively small. 


5 Concretely Computing Contributions to Case-(iv) 
Probabilities at Given Perturbation Levels 

In Section 4, we mathematically derived the ’’ingredients” to compute the probabilities of 
case-(iv) gap-patterns. Now, by putting these ’’ingredients” together, we will concretely 
compute the probabilities at relatively low perturbation levels, to illustrate the computation 
procedures explained at the bottom of subsection 4.5. Throughout this section, again, it is 
assumed that the computations are performed under a locally space-homogeneous model. 
Moreover, we assume that we consider that the ancestral sequence existed at the ’’initial 
time”, ti, and the descendant seqnence was sampled (or examined) at the ”£nal time”, tp- 
This means that the time-interval, [f/,tir], gives the entire time-frame for the evolution of 
the sequence we are interested in. And let AjLa be the nnmber of ancestral sites at time 
tj, and let ApLp, be the number of descendant sites at time tp, both exclnding the PASs, L 
and R. Then, what we want is the probability of having descendant sites at tp conditioned 
on totally non-homologous ancestral sites at t/: 

■Pcase-(iv) [{ApLp),tp) I {AiLaRi)] 
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def 


= P[{[L,v\1),---,v\AfLd),R], tp) I t,)] 


pi) ^ «'(i) 

for = 1,..., AjLa\ 

A = 1, ■■■,ApLd 


(45) 


When we compute the probabilities of ancestor-descendant PWAs, however, what we actually 
need is the multiplication factor: 


jdp 

case-(iv) [{^ELD^tp) I (A/La,^/)] - 


def -^case-(iv) [{ApLo^tF) I {AiLaRi)] 


P [([ ], [tiRp]) I {AiLaRi)] 


(46) 


where -P [([ ], [t/, tp]) \ {AjLa, tj)] is the probability that no indel events hit the region during 
[tptp], conditioned on AjLa ancestral sites in between the PASs, L and R, at tj. Then, in 
our perturbation method, the multiplication factor (Eq.46) is expanded as: 




= 

PP case- 


(iv) [{AFLp,tF) I (AjLaRi)] 
(iv) [{ApPp, tp) I {AjLa, ti)] 


+ Pp case-(iv) [{Apln, tp) I {AiLa, t/)] 


+ ■■■(47) 


Here, Ai^case-(iv)['''] hp'iase-(iv) ['''] total summatious of all 2nd-order contribu¬ 

tions and all 3rd-order contributions, respectively, to the multiplication factor, Eq.46. 

It should be noted that, under the locally space-homogeneous model we are considering, 
the identity: 


■Pcase-(iv) [(ApPp, tp) I {AiLa, tj)] X exp l^+ <iri?^^([L, /?], r)^ 

= hP case-(iv) [{Aplp^tF) I {AiLaRi)] X exp drAP^ {AiLa,t'^ (48) 


holds, thanks to the equation: 


rtp 




P[([],[t 7 ,tp]) I (A,PA,t/)] =exp( - / dr R^^{\L,RIt) + AP^{AiLa,t) 


ylD, 


Actually, the identities similar to Eq.48 hold also for the partial contributions to the prob¬ 
abilities of case-(iv) gapped segments, and they will be used frequently in the following 
sub-sections. 

Once we obtain the analytical form of each contribution, which includes some time- 
integrations, the numerical computation of each time-integration could be most simply done 
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with the trapezoidal formula (see, e.g., Press et ah [62]). If you would, however, Sympson’s 
formula could also be applied (again, see, e.g., Press et ah [62]), with some extra care in the 
vicinity of the boundaries when the time-interval is an odd number times the time-element 
(At). In the following, we will only give the analytical forms of the contributions. 


5.1 Second-order Contributions 

To have a case-(iv) gapped segment, at least two pattern-changing events are necessary: a 
’’deletion” of the ancestral sites, and an ’’insertion” of the descendant sites. This means that 
the minimnm perturbation level for case-(iv) is the second order. Varying in the time-order 
of the insertion and the deletion, as well as in their spatial-order, the second-order patterns 
of ” A/D”-coloring evolution are broadly classihed into three (Figure 10): (i) A —)■ ”0” —)■ D 
; (ii) A —)■ AD —)■ D; and (iii) A —)■ DA —)■ D. Let us derive analytical expressions of the 
contributions from these patterns one by one. 


5.1.1 (i) A ^ ”0” ^ D 


Let us assume that the ”A”-region was completely deleted at some time in [fi — 
and that the ”D”-region was created at some time in [^ 2,^2 + dt 2 \ (with ti < t 2 ). Then the 
analytical expression for the contribution of pattern (i) to the probability (and multiplication 
factor) we want is: 

btlMKAFic, tp) I (A,Lx, t,)]xexv{+ dTR'p^([L,R],T) 

= dWv) tp) I (A;Lx, t,)] X exp ([-/ dTAfiy(A;Lx,T) 


f*t p—dt2 


f*tF—dt2 


/ dti / 


dU 


(/^A-dei [(A/L^, ti) ; (0, 0, [ti — dti,ti])] {asE — ‘^)/dti) x 


XhP case-(i) [(0, ^ 2 ) | (0, fi)] X (/^D-cr [(a^, [h, h + dt2]) {ApLf), tp)] / dt2) 


(49) 


^^Here, the 0 (for ” empty-set”) is double-quoted, to remind us that some evanescent sites may actually 
have existed between the deletion of the ” A”-region and the creation of the ”D”-region. 
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(i) A^”)3"^D 


(ii) A^AD^D 


(iii) A^DA^D 


T 

Figure 10: Topologies of 2nd-order A/D-coloring pattern histories that result in case-(iv) 
gapped segments. (Left) pattern (i) (A —>■ ”0” —>■ D); (Center) pattern (ii) (A —>■ AD —> D); (Right) 
pattern (iii) (A —>■ DA ^ D). As in the previous figures, the red and cyan rectangles represent an ”A”- 
region and a ”D”-region, respectively. And the yellow rectangle (in panel a) represents a lump of evanescent 
sites that are not colored either ”A” or ”D”. The transparent red triangle converging to an ”X” represents 
the complete deletion of an ” A”-region; the transparent cyan triangle diverging from an ”X” represents 
the creation of a ”D”-region. The thin-colored downward arrow represents evolution via the ’’base” rate 
operator. 


Here, ftp case-(i) [( 0 ,^ 2 ) I (0,H)] is the multiplication factor for the ”case-(i) gapped segment”, 
which does not have any ancestral or descendant sites in between L and R, at both times ti 
and ^ 2 - (We do not care whether or not any evanescent sites existed during the open interval, 
(^ 1 ,^ 2 )-) This factor could be computed practically exactly as a by-product of the algorithm 
in SM-3 of [1], or via a faster algorithm based on the recursion relation nearly identical 
to Eq.36 with apEii) = 2; it requires only two modihcations: (1) inclusion of ALj = 0 
(actually, the factors with this value are all we need to compute here) and ALj = 0, and 
(2) replacement of the upper-bound of the 1st summation with min(L^‘^, ALj) . It 
should also be noted that the /iA-dei[- ■ ■] and /rD-cr[- ■ ■] above are inhnitesimal probabilities 
of 0{dti) and 0 {dt 2 ), respectively; thus their division by dti and dt 2 , respectively, give 
Alternatively, we could devise another faster algorithm based on the recursion relation nearly identical 
to Eq.62 with apEii) = 2; again, we need only two modifications: (1) inclusion of ALp = 0 (actually, 
the factors with this value are all we need to compute here) and ALj = 0, and (2) replacement of the 
upper-bound of the 2nd summation with min)^^*^, ALj). 
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probability densities. Another point to note is the existence of the dti, dt 2 , etc. in the 
upper- and lower-bounds of the time-integrations; we deliberately employed this extremely 
uneonventional notation, in order to facilitate the transition from the analytical expression 
to the numerical computation; (we will keep using this notation hereafter, too); if, instead, 
you intend to calculate these equations completely analytically, just ignore these inhnitesimal 
time-elements in the integration boundaries. 

As argued at the bottom of subsection 4.5, it would save time to perform the calculation 
associated with individual events one after another. See appendix D for details on such 
a series of calculations. The calculations can be done with the maximum time-complexity 
of 0{{NpY{L^'^Y), and the space-complexity of at most where Np is the 

number of sub-time-intervals and is the upper-bounds of the number of sites included 
in each colored region. 

5.1.2 (ii) A ^ AD ^ D 

As opposed to the pattern (i), let us assume here that the ”D”-region is created at some time 
in [ti,ti + dti], and that the ”A”-region is completely deleted at some time in [t 2 — ^^ 2 ,^ 2 ] 
(with ti + dti < t 2 — dt 2 ). In this pattern, the ’’base” rate operator, Qo^{i = l;t), for the 
”A”-region (Ci) has cbe = 2 for t and cbe = I ior ti < t < t 2 — dt 2 - Meanwhile, the 

operator for the ”D”-region always has app = 2. Let ALi be the length of the ”A”-region 
at ti, SAL 2 be the number of sites in the ”D”-region deleted in conjunction with the entire 
” A”-region (in [^2 — dt 2 , ^ 2 ]), and AL 2 be the length of the ”D”-region iat t 2 . (We also assume 
here that, during [t 2 — dt 2 , ^ 2 ], the ”D”-region only suffered the deletion that erased the entire 
” A”-region.) Figure 11 illustrates the situation. 

Under this setting, we obtain the following expression for the contribution from pattern 

the top of the above equation, we explicitly wrote down the expression including the portion of 
lcase-(iv)[' ’']) because this is the first concrete expression given in this section. Hereafter, however, we will 
omit such expressions. 
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dL 


► 




▼ 





Figure 11: Setting for computing probabilities of 2nd-order pattern (ii) (A ^ AD D), in 

Eq.50. This figure was created from the center panel of Figure 10, by adding further annotations. In each 
transparent-colored downward arrow, which indicates evolution via the ’’base” rate operator, only the value 
of aBE is shown. Note that, in this figure (and in previous figures), dti, etc. are made disproportionately 
large, to clearly show changes via insertions/deletions. 


11 ; 


[(^fLd, tp) I tj)] X exp (^- dTAR^jP{AjLA,T) 

oo oo oo ptp-dti-dt 2 rtp 

ZEE/ dti / (^^2l^/tpo [(AFi,ti) I {AiLAiti)] {cfbe — 2) X 

ALi=l AL2=1 <5AL2=0 •Jti+dt\+dt2 

X (/l-D-cr [(2^ = ALi, [ti, ti -\- citi]) ; (AL 2 + 6AL2, t2 — dt2)] / dti) X 
X exp (- f dT 2 AR^xi.AL 2 + 5 AL 2 , r 2 )^ x 

V Jt2—dt2 J 

X (/iA-del [(ALi, ti) ; (0, — 5 AL 2 , [^2 — dt2, ^ 2 ])] {cTbE = ^)/dt2) X 
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(50) 


X/^Po [(^P-^P,^p) I (^-^2,^2)] BE — 2 ) 

It should be noted here that, normally, the argument, t 2 — dt 2 of the /iD-cr[- ■ ■] should be 
expressed as t 2 , because the result remain unchanged for the integration of a well-behaved 
function of time, and the exponential factor in the 3rd last line should be omitted because it 
usually gives 1 (unity). Here, however, we deliberately chose these expressions, to facilitate 
the translation to the numerical computation. 

In numerical computation, the upper-limits of the above summations must be hnite 
numbers, instead of inhnity. For example, if we consider each region-length to be or 
less, the above triple-summations will be replaced by: 

LCO lCO LPO-AL2 

ZEE- 

ALi=l AL2=1 (5AL2=0 

As usual, it should be time-efficient to perform the computations associated with indi¬ 
vidual events separately, one after another. Such ”pruning-like” computation in this case is 
detailed in appendix E. The series of computations can be performed with the maximum 
time-complexity of 0{{Np}‘^ and the maximum space-complexity of 0{Np{L^^}‘^). 

5.1.3 (iii) A ^ DA ^ D 

Actually, contributions from this pattern is identical to those from pattern (ii), as long as 
the indel rates are symmetric regarding the reversal of the spacial order of sites, which is 
indeed the case with the locally space-homogeneous model we are now dealing with. Thus, 
we can just ’’borrow” the results of the pattern (ii), to compute the contributions from the 
pattern (iii) here. This is most easily implemented by doubling the results in the previous 
sub-subsection. (Even if the model is not spatially symmetric, we can easily modify the 
equations for the pattern (ii), to obtain those for the pattern (hi).) 

already argued in footnote 23, it would be better to replace the exponential factor with appropriate 
transition probabilities under the unchanged ” A/D”-coloring pattern, if the numerical counterpart of dt is 
not sufficiently small. 
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5.1.4 Summary 


By combining all these results obtained in the previous sub-subsections, we can obtain the 
total contributions from all the second-order terms, as follows: 


hp case-(iv) tp) \ {AjLa, tj)] 

= /^ptise-(iv) tp) I (A/La, tl)] 

0.= i, ii, iii 

= /^p tas!-(iv) [{^fLd, tp) I {AiLa, tl)] + 2x /ip tase-(iv) [{Aplp, tp) \ (A/La, t/)] • (51) 


5.2 Third-order Contributions 

The third-order terms are contributed by ”A/D”-coloring evolution patterns with three, he., 
one plus the minimum (=2), pattern-changing events. In addition to the creation of a ”D”- 
region and the complete deletion of an ”A”-region, the additional event could be an insertion 
or a deletion. Besides, the additional event could be a ’’boundary-eroding” deletion included 
in the perturbation, Eq.21. Varying in (1) the kind of the additional event, (2) the time-order 
of the events, and (3) their spatial relationships, the third-order ” A/D”-coloring evolution 
patterns can be broadly classihed into the following six (Figure 12): (a) A —)■ ADA —)■ AD 
^ D; (b) A ^ ADA ^ DA -> D; (c) A DA ^ DAD ^ D; (d) A ^ AD ^ DAD D; 

(e) A —)■ DA DA —D; and (f) A —)■ AD AD —)■ D. In (e) and (f), the ”B-er” indicates 
that a ’’boundary-eroding” deletion occurred then. As you can see, the patterns (b), (d) and 

(f) are the space-reversal of the patterns (a), (c) and (e), respectively. If the indel evolution 
model we consider is symmetric under the space-reversal(, which is indeed the case here), 
(b), (d) and (f) give contributions identical to those of (a), (c) and (e), respectively. (Even 
if otherwise, the analytical expressions of the former’s contributions can be easily derived 
by properly modifying those of the latter’s.) Hence, in the following, we will only describe 
contributions from the patterns, (a), (c), and (e). 
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m B — er 

’ (a)A^ADA->AD^D (c)A^DA-»DAD^D (e)A-»DA ^ DA-^D 

Figure 12: Topologies of 3rd-order ”A/D”-coloring pattern histories that result in case-(iv) 
gapped segments. (Left) pattern (a) (A —^ ADA —^ AD —>■ D); (Center) pattern (c) (A —>■ DA — 

^3“0r 

DAD D); (Right) pattern (e) (A —)• DA —>• DA —D). The notations are basically the same as in 
the 2nd-order ”A/D”-coloring-pattern histories (Figure 10). The transparent black triangle converging to 
an ”X” represents a ’’boundary-eroding” deletion. Patterns (b), (d) and (f) were omitted here; they can be 
obtained via the space-reversal of patterns (a), (c) and (e), respectively. 


5.2.1 (a) A ADA -)■ AD D 

Let us assume that the ”D”-region was created at some time in + dti], that the ”A”- 
region on the right was completely deleted at some time in [t 2 —dt 2 , ^ 2 ], and that the remaining 
”A”-region was completely deleted at some time in [^3 — ^^ 3 ,^ 3 ], with ti + dti < t 2 — dt 2 
and t 2 <t 3 — dts- In this pattern, the ’’base” rate operators, Qo^(i;t)’s, for the ”A”-regions 
have aBE = 2 before the ”D”-creation (i.e., t < ti), and asE = 1 after the ”D”-creation 
(i.e., ti < t{< ts)). As usual, the ”D”-region always has asE = 2. Let and ArLi be 

the sizes of the left- and right-fragments, respectively, of the ” A”-region at ti. Let SAL 2 
be the number of sites in the ”D”-region deleted in conjunction with the entire ”A”-region 
on the right (in [^2 — ^^ 2 ,^ 2 ]), and AL 2 be the size of the ”D”-region immediately after this 
deletion (and also assume that the ”D”-region suffered no other indels during [^2 — ^^ 2 ,^ 2 ])- 

^®The ’’left- and right-fragments” here mean the fragments of the ”A”-region that are on the left and 
right, respectively, of the position where the ”D”-region was inserted. (We assume that indels involving the 
position did not occur after ti and before the insertion of the ”D”-region.) 
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Figure 13 : Setting for computing probabilities of 3rd-orlier pattern (a) (A —> ADA ^ AD 

—>■ D), in Eq.52. This figure was created from the left panel of Figure 12, by adding further annotations. 
Notes similar to those on Figure 11 apply also here. 


And let 6AL3 be the number of sites in the ”D”-region deleted in conjunction with the entire 
remaining ” A”-region (in [^3 — ^^3,^3]), and AL3 be the site of the ”D”-region immediately 
after this deletion (and also assume that the ”D” region suffered no other indels during 
[^3 — ^^3,^3]. Figure 13 illustrates the setting. 

Under this setting, the contributions from the pattern (a) can be analytically expressed 
as: 

Fpt2-(iv) [{^fLd, tp) I (A/La, tj)] X exp dTAR{^{AjLA,T)^ 
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E E E E E E / 

^]^L/-i=l A/^Zyi=l AZ/2 —1 (5AZ/2=0 AZy3 = l (5AZ/3=0 

/^Po [(^-^1 = ^L-^1 + ^l) I {^iLa, ti)] {(Jbe = 2 ) X 


f*tp—dt^ 


dto 


dU 


• ti-\-dti-\-dt2 


X (/iD-cr [(a: = A^Li, [ti,ti + dti]) ; (AL2 + 6AL2,t2 - dt2)] /dti) x 
X exp f- [ dT 2 AR^J^{AL 2 + 6 AL 2 ,T 2 ) \ x 

\ Jt2—dt2 J 

X (/iA-del [(A/jLi, ti) ; (0, —(5AL2, [^2 — <^^2, ^ 2 ])] = l)/(it2) X 

x/ipo [(AL3 + 5AL3, ts — dts) I (AL2, ^2)] {o'be = 2) 

X exp ( - f drs AR^^{AL^ + 5AL-i,T^)\ x 

X (pA-del [(AlLi, ti) ; {0, —5AL3,[t3 — dtsRs])] {aBE:L = ^)/dts) X 


ft2+dts 


xppo [{ApLbRe) I (ALsjts)] ((Ts£; — 2 ) 


( 52 ) 


(Here, the same notes as in the previous subsection apply to the notations and factors 
with analytically no effects, which were introduced merely to facilitate the translation to the 
numerical computation.) 

This expression involves summations over six lengths and integrations over three time- 
intervals. Thus, if naively performed, it could take too long to hnish within a reason¬ 
able amount of time. Therefore, following the general strategy already proposed, we will 
perform computations associated with the individual events, one after another. See ap¬ 
pendix F for details. The series of computations can be performed with the maximum time- 
complexity of max [ 0 {{Np}‘^, 0 {Np{L^'^}‘^) ] and the maximum space-complexity 
of 0(Ap{L^O}2). 


5.2.2 (c) A -> DA ^ DAD ^ D 

Let us assume that the ”D”-regions on the left and on the right were created at some times 
in [ti,ti + dti] and [^2, ^2 + <^^2], respectively, and that the ”A”-region was completely deleted 
at some time in [^3 — dt^, t^], with ti + dti < h and ^2 + dt2 < ts — dt^,. Here, the ’’base” rate 
operators, ( 5 o^(i;t)’s, for the ”A”-region has apE = 2 for t < ti, apE = 1 for ti < t < ^2, 
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Figure 14 : Setting for computing probabilities of 3rd-order pattern (c) (A —DA DAD — 

D), in Eq.53. This figure was created from the center panel of Figure 12, by adding further annotations. 
Notes similar to those on Figure 11 apply also here. 


and asE = 0 for t2 < t < t^. Let AiL^ and ^2La be the sizes of the ”A”-region at ti, t2, 
respectively; let be the size of the ”D”-region on the left at let be the size 

of the ”D”-region on the right at And let SA^Lrr and SA^Lrr be the numbers of sites in 
the ”D”-regions on the left and right, respectively, that were deleted in conjunction with the 
complete deletion of the ”A”-region (in [t^ — dts]). (And we also assume that the ”D”-regions 
suffered no other indels during [ts — ^^3,^3].) We also dehne A^Lr, A^Lrr + A^Lrr and 

def 

SA^Lr = 6A3LRL + SA^Lrr. Figure 14 illustrates the setting. 

Under this setting, the contributions from the pattern (c) can be analytically expressed 
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as: 


KApic.iF) I (A;La,*/)| X exp drAfiJf (A/Lx, t) 


=E E E E E E , 

^l^A = l A2La = 1 A3L£)L=1 ^3LdR = ^ S^3Lol=0 <5A3Lj3pj=0 

[(^i-^A,^i) I (A/L^jt/)] ((Tbs = 2) X 


tl 

f^tp—dti —dt2—dt^ 


f*tp — dt 2 —dt^ 


rtp 


dti 


dU 


dU 


' ti-\-dti 


' t 2 -\-dt 2 -\-dtz 


^f^Po [(A2-^A,^2) I (AiE^,tl)] {aBE — 1) X 


X (/iD-cr [(a^i = 1 , [^1, tl + dti]) ; {AsLbl + SA^Leil, ts — (its)] / dti) x 

X (/iD-cr [{.X2 = A2Lb,L + ^2La, [t2, t2 + dt 2 ]) ] {A^LbR + SA^LbR, ts — dt^)] / dt 2 ) X 

X exp 1^- [ (ir3[A/t;^^(A3LBL + dA^LoL, a) + dr + 5AsLb)R,T3)] \ X 

\ Jt^—dt^ / 

X (pA-dei [(A2EA,t2) ; ( 0 , —SA^Lr = —{SA^Ldl + SA^Ldr), [ts — dtsRs])] {(Pbe = t))/dts) x 

x/ipo [{ApLd-, tE) I (A^Ld = A^Ldl + AsLbp, ts)] ((Tbb = 2) . 


(53) 


As in the previous cases, we attempt to save time and memory by performing com¬ 
putations associated with the individual events, one after another. See appendix G for 
details. The series of computations can be performed with the maximum time-complexity of 
0{{Np }‘^and the maximum space-complexity of 0{Np{L'"^}^), which could be too 
large for a single personal computer. Fortunately, the computations can be easily cast into 
parallel or distributed computing, which reduces the maximum time- and space-complexities 
to 0{{Np }‘^and 0{Np{L^^}‘^), respectively. 


5.2.3 (e) A ^ DA ^-4"" DA ^ D 

Now, we assume that the ”D”-region was created at some time in -|- dti], that the 

’’boundary-eroding” deletion occurred at some time in [t 2 — dt 2 ,t 2 ], and that the ”A”-region 
was completely deleted at some time in [ts — dts,ts], with ti + dti EG — dt2 and t2 < 
ts ~ dts- Here, the ’’base” rate operators, Qq'^’s, for the ”A”-region has ctbe = 2 for t < ti, 
and (jpE = 1 for t > ti. Let AiL^ and A 2 LA be the sizes of the ”A”-region at ti and 
t 2 , respectively, and let SA 2 LA be the number of sites in the ”A”-region deleted by the 
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Figure 15 : Setting for computing probabilities of 3rd-order pattern (e) (A DA DA 

D), in Eq.54. This figure was created from the right panel of Figure 12, by adding further annotations. 
Notes similar to those on Figure 11 apply also here. 


’’boundary-eroding” deletion. (And we also assume that the ”A”-region suffered no other 
indels during [t2 — ^^2,^2]-) Let /S.2L]:, and A^Lo be the sizes of the ”D”-region at ^2 and ts, 
respectively, and let 6A2LD and SA^Ljj be the numbers of sites in the ”D”-region deleted 
by the ’’boundary-eroding” deletion (in [^2 — ^^2,^2]) and sites deleted in conjunction with 
the complete deletion of the ” A”-region (in [^3 — ^^3,^3]), respectively. (And we also assume 
that the ”D”-region suffered no other indels during [^2 — dt2, ^2] and [ts — dt^, ^3].) Figure 15 
illustrates the setting. 

Under this setting, the contributions from the pattern (e) can be analytically expressed 
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as: 


/^ptlsl(iv) [{^fLd, tp) I (A/La, ti)] X exp dr AR^;^ {A I La, t) 


oo oo oo oo oo oo oo „tp-dti-dt2-dt3 

"EZEZ EEE/ 

AiL^=l A2L^=1 A2Lj:}= 1 5A2LJi^=l SA2L£)=1 A^L£)=1 SA^L£)=0 

LPo [AiLA,ti) I {AjLA,ti)] {aBE(iA) = 2) X 

x/^Po [A 2 LA + ^^2La, h — dt2) I (AiLa, ti)] {(Tbei = 1) x 

X (/iD-cr [{x = 1, + dti\) ; (A 2 LD + 5A2LD,t2 - dt 2 )] /dti) X 

ri 2 


ntp—dt^ 


dt 2 


dU 


‘ ti-\-dti-\-dt 2 


ft 2 +dts 


X exp < — 


dT 2 


AR^^{A2La + 5 A 2 LA, T 2 ) + AR^^{A2Ld + <5A2Ld, r2) 


ID, 


Jt 2 -dt 2 

X5'd(<^A2La + dA 2 L 0 ,^ 2 ) X 

X (/iA-del [(A2i^A, ^ 2 ) ; (0, —SA^Lp, [t^ — dt^, ts])] {(JbE2 = ^)/dt^) X 
x/ipo [(AaLp + SA^Lp, — dt^) \ (A 2 LD, ^ 2 )] {.(Xbe(d) = 2) x 
X exp I - f drs AR^xA^Lp + SA3Lp,T3)\ x 

L Jt^—dt^ J 

xppo [(ApLp,tp) I (A3Lp,t3)] {<Jbe{d) = 2 ) 


X 


(54) 


As we can see, this computation could be harder than the computations in the patterns 
(a) and (c), because it involves summations over seven lengths, more than the six lengths 
for the patterns (a) and (c)!! As in the previous cases, we will follow the general strategy, 
and perform the computations serially. See appendix H for details. The largest compu¬ 
tation step in this series of computations can be performed with the time-complexity of 
0({A’p}^{L'"‘^}^) and the space-complexity of 0(iVp{L'"'^}^), which could be too large for 
a single personal computer. Fortunately, the computation can be easily cast into parallel or 
distributed computing, and be ’’decomposed” into smaller-sized computations, 

each of which has the time- and space-complexities of and 

respectively. 
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5.2.4 Summary 


By combining all these results obtained in the previous sub-subsections, we can obtain the 
total contributions from all the third-order terms, as follows: 


/^p'iase-(iv) tp) \ {^iLa, f/)] 

= tp) I (A/La, tl)] 

0.= a, b, c, d, e, f 

= 2x I ^ tp) I {^iLa, t/)]| • (55) 

'^a= a, c, e 


5.3 Notes on probabilities of gapped segments on boundaries 

Provided that the evolution model at hand dictates how to handle indel rates on the boundary, 
it is not so hard to derive the formulas for the probabilities of gapped segments on the se¬ 
quence boundary: in each of the formulas provided in this section, just replace one of the 
indel rates in the bulk with the indel rates on the boundary, when the entity in question is on 
the boundary. One thing that must be kept in mind is that the probabilities of horizontally 
symmetrical patterns {e.g., the 2nd-order patterns (ii) and (hi)) are not equal any longer. 
Therefore, the summations given in sub-subsections 5.1.4 and 5.2.4 cannot be reduced to their 
hnal forms, and you must compute the contributions from every single pattern diligently. 
This can double the computational time compared to that for the bulk probabilities. 

In the actual sequence study, boundary indel rates can vary greatly depending on a num¬ 
ber of factors , including how the aligned sequences were prepared, and whether any biological 
important sites or regions are on (or near) the boundaries. Therefore, without any clear ideas 
on these factors , it may be a good idea to exclude gapped segments on the boundaries from 
your data analyses. 
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6 Implementing and Validating the New ’’Perturba¬ 
tion” Method 

6.1 Implementation 

The new ’’perturbation” method proposed here was implemented into a package of proto¬ 
type Perl scripts and Perl packages, named ”LASTPIECE(_P),” which abbreviates ’’Local 
Alignment-STate Probability that Insertion-type and dEletion-type gaps Co-Exist(, Perl- 
version).” The package can do the following: 

1. it computes the multiplication factors of the probabilities of case-(i), (ii), (hi) and (iv) 
gapped segments of ancestor-descendant PWAs under a (locally) space-homogeneous 
genuine stochastic sequence evolution model; 

2. currently, it explicitly incorporates the evolution model used by Dawg [48] with either 
power-law or geometric indel length distributions, although it can easily be modified 
to accept, any genuine evolution model (as long as it is (locally) space-homogeneous) 
and/or any indel length distributions; 

3. Its main master script, ’lastpiece.alpha.pl’, collectively computes the multiplication 
factors of the gap-conhgurations with 1, 2, ..., Nub ancestral sites and/or with 1, 2, 
..., Nub descendant sites, where Nub is a user-specified upper-bound, for a range of 
time-lapses, from near zero to a user-specihed maximum value; 

4. by default, it computes the multiplication factors of the case-(iv) gapped segments up 
to (and including) the 3rd-order perturbation level; 

5. it does not only output the hnal results, i.e., the ’’practically exact” multiplication 
factors of case-(i), (ii) and (iii) gapped segments, as well as the factors of case-(iv) 
gapped segments up to 3rd-order, but it does also output the total contributions from 
the 2nd-order patterns (i) & (ii), and those from the 3rd-order patterns (a), (c) & (e); 
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6. the package also has some supplementary scripts that enable simple analyses on the 
computed multiplication factors, such as collectively calculating theoretically predicted 
frequencies of gap-conhgurations, collectively calculating the ratios, each of a case-(iv) 
multiplication factor to the corresponding product of case-(ii) and case-(iii) factors, 
tallying the frequencies of gap-conhgurations in a set of simulated ancestor-descendant 
PWAs, etc. 

7. unfortunately, the current version computes only the probabilities in the bulk, however, 
it is possible to implement the computation of probabilities on the boundaries (see 
subsection 5.3); this is left as a future task. 

It is worth noting that, once the main outputs of the main master script (’’lastpiece.alpha.pl”) 
are obtained, they can be fed again and again into algorithms to compute probabilities of 
ancestor-descendant PWAs or MSAs(, as long as the computational parameters match); 
such algorithms are integral parts of other program packages of ours, namely, LOLIPOG [1], 
ComplLiMment [30], and ANEX [56]. 

This package (LASEPIECE(_P)) is available as an open-source package at the FTP repos¬ 
itory of the ANEX project in Bioinformatics.org (https://www.bioinformatics.org/ftp/pub/anex/). 
(Currently, the package runs on the Terminal of Mac OS X; it should run also on some other 
UNIX platforms, including Linux, although we have not yet conhrmed that it does.) 

In the current version (ver. 0.3), the main master script (’’lastpiece.alpha.pl”) performs all 
the computational steps serially, it thus uses only a single CPU (or core) and is considerably 
slow. As explained in appendixes E through H, however, most of the computational steps 
This strategy of pre-computing and re-using the multiplication factors has already been suggested in 
Additional File 1 of [1], and has actually been implemented since version 0.6 of LOLIPOG(_P) [1] and 
version 0.6 of ComplLiMment(_P) [30], both of which were released in 2015. We learned that a recent 
’’simulation-based” approach to statistical PWA [53] also employs this strategy of pre-computing and re¬ 
using the probabilities of gapped segments (more precisely, ’’chop-zones” in their study). We do not know 
whether they have just borrowed our idea or independently come up with this strategy by themselves. 
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can be easily cast into parallel or distributed computing; we expect that, once this is done, 
and if the main scripts are also translated into C, for example, the computation could be 
more than 1000 times faster (if distributed to about 100 CPUs). 

6.2 Validation via in silica experiments 

In order to validate the method presented here, we simulated the evolution of 100,000 DNA 
sequences, each of which was 10,000 bases long initially, down each of the time-intervals 
of 0.2, 0.5, and 1.0 (in the unit of the expected number of substitutions per site); then, we 
counted how frequently each conhguration of the gapped segments occur in the bulk (he., not 
on either end) of the resulting ancestor-descendant PWAs; and hnally, compared such ’’ob¬ 
served” frequencies of the gap-conhgurations with their theoretically expected frequencies 
computed from the multiplication factors obtained by running ’’lastpiece.alpha.pl” described 
in the previous subsection. (Appendix I explains how we computed the theoretically expected 
frequencies from the multiplication factors.) 

The simulation parameters used are: insertion rate = deletion rate = 0.1 (events/site/unit¬ 
time); upper-bound of the insertion length = upper-bound of the deletion length = 100; 
insertion- and deletion-length distributions are both power-law with the exponent of -1.6 
(he., {frequency} oc where L is the indel length). 

We matched these parameters with the corresponding parameters for ’’lastpiece.alpha.pl”. 
The remaining important parameters for ’’lastpiece.alpha.pl” are set as follows: upper-bound 
of the number of initial or hnal sites for the multiplication factors to be output = 100; 
upper-bound of the number of sites as the arguments of multiplication factors (such as 

should be kept in mind that the ’’observed” frequency of each configuration is a stochastic variable, 

which in general approximately follows a Poisson distribution. Thus, when dealing with an ’’observed” 

frequency, don’t forget that it is actually fluctuating, with its standard error well-approximated by its square 

root. (For example, when an ’’observed” frequency is 100, its standard error is about 10 (= vTOO).) 
^^Parameters for substitutions are irrelevant, because we are interested exclusively in insertions/deletions 

here. 
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/iPfi [(AL 2 ,t 2 ) I (A/Li,ti)] {(Jbe)) to be computed = 150 size of the sub-time-interval (cor¬ 
responding to dt) = 0.01. On our Mac Pro (Late 2013) desktop computer (with OS version 
10.11.6, with one 3.5 GHz 6-core Intel Xeon E5 Processor and 16 GB physical memory), 
using only a single core, it took 307 hours and 51 minutes, or 12.83 days, to hnish this 
computation. 

The full results of the numerical computation are freely available as an archive hie that ac¬ 
companies LASTPIEGE(_P) at the Bioinformatics.org FTP repository (https://www.bioinformatics.org/fti 
Here, we only show the results at some sample points, which we believe are enough to demon¬ 
strate the accuracy of our new perturbation method. 

In a previous study of ours [1], we compared the multiplication factors of case (ii) and 
case (iii) gapped segments at various perturbation levels to the ’’practically exact” solutions; 
in that study, however, the ’’practically exact” solutions themselves were not validated. Here, 
we hrst conduct this validation, by comparing the theoretically expected frequencies of case 
(ii) and case (iii) gap-conhgurations computed from the ’’practically exact” multiplication 
factors to the frequencies ’’observed” in the simulated PWAs. As shown in Table 1, the 
’’practically exact” theoretical frequencies do indeed approximate the actually ’’observed” 
frequencies extremely well ! (Although the table here shows the results for case (ii) gapped 
segments only, the approximations are actually extremely good also for case (hi).) 

Then, we go on to validate our main subject, i.e., the multiplication factors of the case- 
(iv) gapped segments. Tables 2, 3, and 4 show the results with the time-lapses of 0.2, 0.5, and 
1.0, respectively each table compares some observed frequencies with the corresponding 
theoretical expectations by only parsimonious indel histories, by parsimonious plus next-to- 
parsimonious indel histories, by our new perturbation method (2nd-order only), and by our 

^^This value was chosen in order to take some account of the effects of intermediate states with more than 

100 sites, while keeping the computation-time from growing tremendously. 

^^These translate to 0.04, 0.1, and 0.2 indels/site, respectively, under the current setting of {total insertion 

rate} = {total deletion rate} = 0.1 events/unit-time. 
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Table 1: ’’Practically exact” theoretical frequencies compared to ’’observed” frequences: for 
case (ii) gap-configurations 



Time-lapse = 0.2 

Time-lapse 

= 0.5 

Time-lapse 

= 1.0 

Ua * 

Theor.f 

Obs.^ (Ratio^) 

Theor.f 

Obs.^ (Ratio^) 

Theor.l 

Obs.l- (Ratio^) 

1 

7647035 

7639772 (1.001) 

14545190 

14520543 (1.002) 

18462846 

18425261 (1.002) 

2 

2555279 

2556655 (0.999) 

4954761 

4956088 (1.000) 

6476811 

6471697 (1.001) 

5 

592030 

592456 (0.999) 

1154814 

1154915 (1.000) 

1524810 

1523018 (1.001) 

10 

195150 

195096 (1.000) 

380309 

379167 (1.003) 

501694 

500482 (1.002) 

20 

64347 

63781 (1.009) 

125332 

124964 (1.003) 

165252 

164916 (1.002) 

30 

33672 

33714 (0.999) 

65690 

65579 (1.002) 

86816 

86379 (1.005) 

40 

21285 

20923 (1.017) 

41621 

41318 (1.007) 

55183 

54745 (1.008) 

50 

14914 

14798 (1.008) 

29218 

28968 (1.009) 

38844 

38681 (1.004) 

60 

11141 

11046 (1.009) 

21830 

21688 (1.007) 

29054 

28959 (1.003) 

70 

8679 

8631 (1.006) 

16950 

16559 (1.024) 

22528 

22171 (1.016) 

80 

6934 

6867 (1.010) 

13411 

13438 (0.998) 

17755 

17477 (1.016) 

90 

5548 

5442 (1.019) 

10491 

10163 (1.032) 

13866 

13563 (1.022) 

100 

3511 

3375 (1.040) 

6867 

6724 (1.021) 

9763 

9673 (1.009) 


(The total indel rate is 0.2 indels/site/unit-time.) 

* The number of ancestral sites. 

^ The theoretically expected frequency computed from the ’’practically exact” multiplication factor 


(rounded to the nearest whole number), 
t The frequency observed in simulated PWAs. 

The ratio of the theoretically expected frequency to the observed frequency (rounded to the nearest thousandth). 


61 





new perturbation method (2nd-order plus 3rd-order). 

These tables clearly indicate that the new perturbation method provides a dramatic im¬ 
provement in the accuracy, compared to our previous method based only on parsimonious 
(and possibly next-to-parsimonious) contributions; the accuracy improvement gets more re¬ 
markable as the numbers of sites increase. In fact, the theoretical prediction by the new 
perturbation method is amazingly accurate: even with only the 2nd-order terms, the predic¬ 
tion is more than a half of the observation up to (and including) 75 ancestral (=descendant) 
sites (when time-lapse = 0.5); if we incorporate the 3rd-order terms, this is the case even up 
to (and including) 95 sites and even with time-lapse = 1.0!! 

Still, accounting for just a little over ^ of the observation may not be satisfactory for 
some people or some sorts of applications. As far as we can think of, there are at least 
three potential causes of these underestimations: (1) insufficiently hne partition of the time- 
interval, (2) insufficient incorporation of the effects of gapped segments containing more sites 
than the upper-bound for the output, and (3) lack of the 4th- or higher order terms. They 
are detailed in the following. 

(1) In our in silico experiment, we used the sub-time interval of 0.01 (= At). This may 

^^The frequencies compared here are ’’double-cumulative” frequencies. The double-cumulative frequency 
at {x,y) is defined by the summation of the frequencies over ua = x, and nu = Here, 

ua is the number of ancestral sites in the gapped segment, nu is the number of descendant sites in the 
gapped segment; and are the upper-bounds of ua and no, respectively. In this validation study, 
= njj'® = 100. These double-cumulative frequencies were used here because each particular case-(iv) 
configuration occurred only less than once in our simulated PWAs if both ancestral sites and descendant 
sites are many (~100). 

^®These tables (ie., tables 2, 3, and 4) may give you an impression that the theoretically expected 
frequencies substantially underestimate the observed frequencies at small values of had- Such ostensible 
underestimates actually resulted mostly from the underestimated frequencies of larger gap-conhgurations. 
(Remember that the tables show double-cumulative frequencies.) Indeed, when comparing the expected raw 
frequencies of individual gap-configurations with the observed ones, the expected frequencies {via 2nd -I- 3rd 
order) are nearly 100% of the observed ones at had = 1, and about 95% at had = 10. 
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not have been sufficiently fine, especially when considering a large number of sites. For 
example, when AL = 100 and giir) = goiT) = 0.1, the increment of the exact rate is 
AR^^{AL, t) = ( 5 f/(r)+ 5 f£)(r)) X AL = 0.2x100 = 20. Thus, At x Ai?;^(AL, r) = 0.2. 
This means that the region having 100 sites could suffer an indel in nearly 20% of the cases 
even during the smallest time-interval. At; this in turn means that a considerable fraction 
of events (or histories) could have been ignored in our numerical computation. This effect of 
” coarse-graining” is also indicated by the somewhat unexpectedly poor accuracy for the cases 
with the time-interval = 0.2, which is approximated by only 20 sub-intervals. At present, the 
only remedy for this problem is to use a smaller sub-time interval. Since the computational 
steps are at most 0({Ap}^), where Np is the number of partitions (of the largest time- 
interval, which is 1.0 in this study), the computational time is expected to become 4 times 
as long if At is halved, and 16 times as long if At is quartered. 

(2) In numerical computations, we always need to set an upper-bound (say, 100) of the 
number of sites that the subject region can contain. On the other hand, the actual evolution 
of sequences should not care about such an artificial upper-bound; in other words, there 
could be evolutionary histories in which the subject region contained more sites than the 
upper-bound once(, or twice, ...) in the middle of the evolutionary course, and in which 
the number of its sites finally got within the upper-bound. The contributions from such 
evolutionary histories are destined to be ignored, and the effects of such ignored histories are 
expected to be bigger as the number of (initial or final) sites approaches the upper-bound. 
One way to alleviate the effect of these ignored histories is to set two kinds o/upper-bounds, 
one for the purpose of computation and the other for the purpose of the output. (The former 
is usually larger than the latter.) Actually, we took this measure: setting the upper-bound 
of 150 for computation and 100 for the output (shown in the tables in this paper). We 
also perform the computation using the upper-bound of 100 for both computation and the 
output {data not shown)] the comparison of both results clearly indicated that this ’’enlarged 
upper-bound for computation” —em does work, sometimes enhancing the prediction up to 
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2- or 3-fold (especially when the time-lapse is large). However, it remains to be seen whether 
this upper-bound of 150 for computation was enough or not. For example, increasing the 
upper-bound to 200 may dramatically improve the accuracy further. The problem is that 
the site-number dependence of the computational time could be where is 

the upper-bound of the site number for computation. Thus, using 200 instead of 150 could 
more than triple the computational time. 

(3) The current version of LASTPIECE(_P) incorporates up to (and including) the 3rd- 
order contributions. It is quite natural to expect that the accuracy will improve further 
if the 4th or higher-order contributions are also incorporated. We expect that the 4th- 
order terms will substantially increase the accuracy, because the 4th-order terms includes 
the evolutionary histories in which both insertions and deletions occur in the middle of 
the region. Insertions/deletions (indels) occurring in the middle of the region can occur at 
multiple alternative positions, whereas indels occurring at either end of the have only two 
options, at most. Thus, the number of possible histories could substantially increase if the 
indels occur in the middle. This is the reason why the 4th-order terms are expected to 
improve the accuracy substantially. We are not sure, however, how much the 5th or higher 
order terms will improve the accuracy. 

Of the above three potential causes, (1) and (2) should be rectihed relatively easily: as 
long as you have enough time, computer memory and storage, just changing the parameters 
and muring the current version of LASTPIECE_P would solve (or ease) the problem. Still, 
this consumes a tremendous amount of computational time, and thus translating into C 
and/or introducing parallel (or distributed) computing will greatly enhance the utility of the 
method. We recommend addressing problem (3) only after problems (1) and (2) are taken 
care of. It is possible that, as long as is around 100, solving problems (1) and (2) should 
provide enough accuracy, and that solving (3) should gain only a limited improvement. It 
is certain, however, incorporating higher-order terms should improve the accuracy if 
increases further. 
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Table 2: Comparing ’’observed” frequences * with various theoretical expectations: for case 
(iv) gap-conhgurations with time-lapse = 0.2 (or 0.04 indels/site) 



Observed^ 

Theoretically expected 1 


Uad * 


parsimonious 

parsimonious 

-b next-to- 

parsimonious 

new- 

pert urbation'’, 

2nd order 

new- 

pert urbation^ 

2nd -b 3rd order 

1 

839485 

584428 (0.70)* 

735068 (0.88) 

759127 (0.90) 

782203 (0.93) 

2 

389944 

206152 (0.53) 

304891 (0.78) 

338223 (0.87) 

357031 (0.92) 

5 

164745 

56895 (0.35) 

104936 (0.64) 

135579 (0.82) 

147156 (0.89) 

10 

79684 

18007 (0.23) 

40373 (0.51) 

63333 (0.79) 

70104 (0.88) 

25 

21078 

2024 (0.096) 

6213 (0.29) 

15684 (0.74) 

17759 (0.84) 

50 

3219 

119 (0.037) 

475 (0.15) 

2328 (0.72) 

2656 (0.83) 

75 

285 

5.8 (0.020) 

25 (0.088) 

206 (0.72) 

234 (0.82) 

90 

21 

0.40 (0.019) 

1.5 (0.071) 

15 (0.71) 

17 (0.81) 

95 

6 

0.088 (0.015) 

0.29 (0.048) 

2.8 (0.47) 

3.2 (0.53) 


* The frequencies compared here are ’’diagonal”, ’’double-cumulative” frequencies, which at uad is defined here 


as the summation over ^{ancestral sites} = uad^ 100 and #{descendant sites} = had, 100. 
1 The frequency observed in simulated PWAs. 

1 The theoretically expected frequency computed from the ’’practically exact” multiplication factor. 

(Theoretically expected frequencies are written to two significant figures if they are less than 10; 
otherwise, they are rounded to the nearest whole number.) 

The new perturbation method proposed in this paper. 

* The parenthesized number in each cell of the 3rd, .., or 6th column is the ratio 

of the theoretically expected frequency to the observed frequency (written to two significant figures). 


65 





Table 3: Comparing ’’observed” frequences * with various theoretical expectations: for case 
(iv) gap-conhgurations with time-lapse = 0.5 (or 0.1 indels/site) 



Observed^ 

Theoretically expected ^ 


nAD * 


parsimonious 

parsimonious 

-|- next-to- 

parsimonious 

new- 

pert urbation'’, 

2nd order 

new- 

pert urbation^ 

2nd -|- 3rd order 

1 

4115710 

2114384 (0.51)* 

2932548 (0.71) 

3667248 (0.89) 

3940229 (0.96) 

2 

1983127 

594118 (0.30) 

1053356 (0.53) 

1633121 (0.82) 

1854776 (0.94) 

5 

862274 

106693 (0.12) 

259875 (0.30) 

641992 (0.74) 

776779 (0.90) 

10 

430149 

20742 (0.048) 

66275 (0.15) 

296020 (0.69) 

374083 (0.87) 

25 

118736 

893 (0.0075) 

4109 (0.035) 

72998 (0.61) 

96877 (0.82) 

50 

19752 

29 (0.0015) 

136 (0.0069) 

11109 (0.56) 

15063 (0.76) 

75 

2041 

2.1 (0.0010) 

7.7 (0.0038) 

1068 (0.52) 

1456 (0.71) 

90 

214 

0.22 (0.0010) 

0.74 (0.0035) 

93 (0.43) 

130 (0.61) 

95 

53 

0.057 (0.0010) 

0.18 (0.0034) 

21 (0.40) 

30 (0.57) 


♦ . t. t. b. ♦ same notes as in Table 2 apply. 
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Table 4: Comparing ’’observed” frequences* with various theoretical expectations: for case 
(iv) gap-conhgurations with time-lapse = 1.0 (or 0.2 indels/site) 



Observed! 

Theoretically expected ! 


nAD * 


parsimonious 

parsimonious 

-|- next-to- 

parsimonious 

new- 

pert urbation'’, 

2nd order 

new- 

pert urbation^, 

2nd -|- 3rd order 

1 

10964544 

4031861 (0.37)* 

6088386 (0.56) 

9172625 (0.84) 

10448969 (0.95) 

2 

5596227 

879561 (0.16) 

1850172 (0.33) 

4085212 (0.73) 

5114311 (0.91) 

5 

2548574 

95576 (0.038) 

294546 (0.12) 

1557902 (0.61) 

2170348 (0.85) 

10 

1322899 

10819 (0.0082) 

43926 (0.033) 

703358 (0.53) 

1051858 (0.80) 

25 

393051 

288 (0.00073) 

1249 (0.0032) 

172182 (0.44) 

278141 (0.71) 

50 

72832 

16 (0.00022) 

58 (0.00080) 

27321 (0.38) 

45839 (0.63) 

75 

9131 

1.3 (0.00014) 

4.8 (0.00053) 

2969 (0.33) 

5077 (0.56) 

90 

1045 

0.15 (0.00014) 

0.50 (0.00048) 

311 (0.30) 

543 (0.52) 

95 

262 

0.037 (0.00014) 

0.12 (0.00046) 

77 (0.29) 

136 (0.52) 


♦ . t. t. b. ♦ tJie same notes as in Table 2 apply. 
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6.3 Significance of genuine sequence evolution model 


Since we confirmed that the new perturbation method enables us to compute the multiplica¬ 
tion factors of case-(iv) gapped segments quite accurately already at the 3rd-order level, we 
can use now these multiplication factors to compare the case-(iv) multiplication factors to the 
product of the corresponding case-(ii) and case-(iii) multiplication factors at the same finite 
time-lapse. Previously [1], this comparison was made only in the limit where the time-lapse 
approaches 0 (zero). 

Table 5 shows the results, in terms of the ratio of a raw case-(iv) factor, which was 
computed up to (and including) the 3rd order, to the product of practically exact case-(ii) 
and -(hi) factors. (The table shows only the ratios of some ’’diagonal” conhgurations, with 
T^^ancestral residues = ^(^descendant residues = uad ) It is noteworthy that the ratio vary 
greatly, from around 1.5 at uad = 1 to over 15 at had = 50; the ratio does not seem to 
depend so much on the time-lapse at least up to uad = 50. We consider that the results 
for Had > 50 are more or less due to artifacts caused by the hnite upper-bound of the 
indel lengths (, which is 100 here); especially with a small time-lapse, the ratio is mostly 
determined by the overlapping indels in the 2nd-order patterns (ii) and (iii), and the number 
of such indels decreases as had increases. 

It should be noted that, in the simple generalized HMM of Kim and Sinha [58], this ratio 
should be 1 (unity), independently of the numbers of ancestral and descendant residues. 
Standard HMMs {e.g., [37]) are also expected to give more or less similar results. Therefore, 
this Table 5 demonstrates how important it is to use genuine sequence evolution models when 
accurately computing (and comparing) the probabilities of ancestor-descendant PWAs, which 
in turn will provide building blocks in the computation of MSA probabilities {e.g., [3, 30], 
which borrowed the concept of phylogenetic MSA construction from [63, 64]). 

Some might argue that generalized HMMs {e.g., [65]) should provide enough flexibility 
to incorporate such variations in the relative probabilities of gapped segments (see, e.g., [66]; 



but also see [67] to prevent your view from being biased). Although their claim may be true 
in a sense, you must remember that generalized HMMs should suffer from the ’’agony of 
the rich”; that is, generalized HMMs should in general have a compromised predictive power 
because it can accommodate much more degrees of freedom than necessary. It should be 
noted that the variation in the ratio shown in Table 5 is nothing other than a consequence of 
the evolutionary principle, and thus it is a theoretical prediction that emerged naturally from 
the nearly accurate computation under the genuine sequence evolution model; to obtain it 
with a genuine evolution model, you don’t need to artificially adjust any parameters. The 
above consideration definitely argues for the necessity of using genuine sequence evolution 
models for an accurate computation of alignment probabilities. 

Of course, you could use some generalized HMMs with the parameters adjusted so that 
they will reproduce the evolutionary features as shown in Table 5. In this case, however, 
you already depended on the genuine sequence evolution model, and thus its indispensability 
will never be compromised at all. Besides, there may be other yet undiscovered features 
of the genuine sequence evolution model that cannot be reproduced by the aforementioned 
parameter-adjusted generalized HMMs. 


7 Discussions 

As indicated by previous studies {e.g., [10, 29, 30]), reconstructing a sequence alignment 
is a crucial yet highly error-prone step, and one of the major causes of errors is the inher¬ 
ent stochasticity of sequence evolution processes {e.g., [31, 32, 30]). This fact necessitates 
the probability distribution of alignments under the genuine sequence evolution model, and 
makes it all the more crucial to compute the probabilities of sequence alignments as ac- 

^^The exceptions to these arguments are generalized HMMs that have been directly derived from genuine 
sequence evolution models. For example, the ’’long indel” model [2] and the models proposed (and used) in 
our previous works [3, 1, 30] fall into this category. 
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Table 5: Comparing case-(iv) multiplication factors * to product of case-(ii) and case-(iii) 
multiplication factors 


Uad * 

Time-lapse = 0.2 

Time-lapse = 0.5 

Time-lapse = 1.0 

1 

1.58 t 

1.62 

1.62 

2 

1.94 

2.03 

2.09 

5 

3.18 

3.36 

3.51 

10 

5.26 

5.62 

5.88 

25 

10.91 

11.76 

12.23 

50 

16.22 

17.34 

17.87 

75 

13.52 

14.79 

16.51 

90 

7.58 

10.13 

13.96 

95 

5.15 

8.65 

13.44 

100 

3.39 

8.63 

14.39 


(The total indel rate is 0.2 indels/site/unit-time.) 

* The case-(iv) multiplication factors compared here are raw, ’’diagonal”, factors, 
with ^{ancestral sites} = #{descendant sites} = uad- 
i The number in each cell is the ratio of the case-(iv) multiplication factor (2nd + 3rd order) to 
the product of corresponding case-(ii) and case-(iii) multiplication factors. More precisely: 

^P^caidiv) tp) I {riAD, ti)] j case-(ii) [(“) tp) | ijlAD, T)] X /Xp case-(iii) tp) \ (“, t/)]| , 

with tp — ti = 0.2, 0.5, or 1.0. Each ratio is written to the nearest hundredth. 
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curately as possible. Fortunately, the probability of alignments under a genuine sequence 
evolution model is factorable into the product of an overall factor and contributions from 
gapped segments {e.g., [2]), provided that the model satishes a couple of conditions [3]. 
Regarding ancestor-descendant PWAs, the gapped segments were classihed into the four 
categories, namely, case-(i), -(ii), -(hi) and -(iv) segments [1]. In a previous work of ours 
[1], we provided a pair of methods to numerically compute practically exact probabilities 
of case-(i), -(ii) and -(iii) segments, but left the computation of practically or nearly exact 
probabilities of case-(iv) segments unresolved as the ’’last piece of the puzzle”. 

In this study, aiming to resolve this ’’last piece of the puzzle”, we constructed a new 
’’perturbation” method, by reformulating the previous perturbation framework of genuine 
sequence evolution model [3, 1]. The validation analyses indicated that new ’’perturbation” 
method works considerably well, computing the probabilities of case-(iv) segments fairly 
accurately, already at the 3rd-order level (when the 2nd-order is the lowest), even when 
both the numbers of ancestral and descendant residues are near the upper-bound of 100. 
(See Tables 2, 3, and 4.) Now that we can compute fairly accurate probabilities of case- 
(iv) segments, we should be able to compute the probabilities of ancestor-descendant PWAs 
fairly accurately. Moreover, using the technique of stacking up ancestor-descendant PWAs 
along the phylogenetic tree [63, 64, 3, 30], we could also compute the probabilities of MSAs 
much more accurately than in the previous efforts {e.g., [30]). In fact, in view of the result 
that the approximation by parsimonious and next-to-parsimonious indel histories are pretty 
poor when gaps are considerably long (Tables 2, 3, and 4), it may be better to revisit the 
previous analyses on MSA errors [30], using the results of this new perturbation method. 

Meanwhile, the results of our analysis (Table 5) revealed an important feature of genuine 
sequence evolution that can never be reproduced by other, non-evolutionary, probabilistic 
models of sequence alignments, including standard HMMs, reconhrming our previous claim 
[1] that using genuine sequence evolution models is indispensable to the computation of ac¬ 
curate probability distribution of sequence alignments, hence to the truthful ’’reconstruction” 
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of sequence alignments. 

Each term of the case-(iv) probability in our new perturbation method (in section 5) 
involves multiple time-integrations and multiple summations over intermediate states (in 
terms of the number of residues), and its naive numerical computation could take practically 
forever even with current state-of-the art desktop computers Fortunately, such multiple 
time-integrations combined with multiple summations could be unraveled into a series of 
much lighter [i.e., less complex) computations, just as in the pruning algorithm for computing 
the likelihood of a tree given an MSA {e.g., [59, 60, 61]). 

We implemented the computational method up to (and including) the 3rd-order into a 
package of Perl scripts and modules, named ”LASTPIECE(_P),” which is short for ’’Local 
Alignment-STate Probability that Insertion-type and dEletion-type gaps Co-Exist(, Perl- 
version).” Although the current version (ver. 0.3) is fairly slow, it could potentially be 
speeded up 1,000-fold or more, if it is translated into a more time-efficient language, such as 
C, and if a few parts of particularly heavy computations are cast into parallel (or distributed) 
computing. And you must not forget that, once the program hnishes running, the obtained 
results can be reused over and over again, and fed into other programs to compute fairly 
accurate probabilities of PWAs or MSAs, including our new program package, ”ANEX_P” 
(’’Alignment Neighborhood EXplorer(, Perl-version)”), that approximately computes the 
probability distribution of alternative MSAs under some genuine sequence evolution model 
with realistic indels [56]. 

We expect that extending the computation to the 4th- or 5th-order will further improve 
the accuracy, at some expense of computational time. Although you may be able to imple¬ 
ment the 4th-order computations in an ad-hoc manner, as we did in the 2nd- and 3rd-orders, 

^®The situation may change once quantum computation technologies become feasible and widely available 
in the future. 

^®This strategy of pre-computing and re-using the multiplication factors is very similar to the strategy 
employed by a recent ’’simulation-based” approach to statistical PWA [53]. We are sure, however, that we 
have devised the strategy by ourselves. See footnote 30 for more details. 
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it would be a greatly painstaking job. And we expect that an ad-hoc implementation of 
the 5th-order computations would be almost impossible, unless a large number of excellent 
experts are involved. Thus, the key to implement such higher-order computations should 
be to devise a way to unravel the higher-order computation automatically, like the pruning 
algorithm for computing the tree likelihood under a given residue conhguration of an MSA 
{e.g., [59, 60, 61]). It should be remembered that the computation of 2nd- and 3rd-order 
terms were performed by merging the building blocks one after another in a bottom-up man¬ 
ner, and there seem to be some patterns regarding how the blocks are merged, although we 
are not sure whether the patterns have already been exhausted or not. In any case, imple¬ 
menting individual merger processes in these patterns and unraveling each term into a series 
of these merger patterns should be the key to the automated computation of higher-order 
terms. 

The success of our new perturbation method implies that a similar method may work 
also for the probabilities of MSAs. We believe that the key to the success of this new per¬ 
turbation method was its smart classihcation of indel events into ’’base” and ’’perturbation” 
indels. Specihcally, ’’base” indels are indels not changing the coloring pattern, which can 
occur relatively frequently because of their rich positional degrees of freedom; ’’perturbation” 
indels, on the other hand, are indels changing the coloring pattern, which occur relatively 
rarely because of their positional constraints. When dealing with MSAs, (the topology of) 
the ancestral states, i.e., sequence states at internal nodes, may be the analog of the coloring 
pattern for an ancestor-descendant PWA. Indeed, indel histories that keep the ancestral-state 
topology includes all the indels histories that keep the ancestor-descendant PWAs along indi¬ 
vidual branches. And, using the results of this study and a previous study of ours [1], we are 
now capable of computing nearly exact probabilities of ancestor-descendant PWAs. There¬ 
fore, by reformulating the probability of an MSA, from the summation over all indel histories 
that creates the gap-conhguration of the MSA to the summation over all sets of ancestral 
states that are consistent with the MSA (exactly as the reformulation from Eqs.(R7.4&2) to 
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Eqs.(R7.5&6) of [3]), and by sorting the sets of ancestral states in order of increasing com¬ 
plexity, we may construct a ’’perturbation method” to compute MSA probabilities. And, as 
the results of this paper suggest, even the contributions from parsimonious (and maybe plus 
next-to-parsimonious) sets of ancestral states alone may provide fairly accurate probabil¬ 
ities of MSAs (more precisely, probabilities of gapped segments in MSAs). For example, we 
previously estimated the frequencies of gapped segments in the MSAs of three sequences 
using only parsiminous indel histories, and found a small but non-negligible subset of (prob¬ 
ably long) segments whose frequencies were substantially underestimated (Figure 6 b of [1]). 
Using the method suggested in this paragraph could dramatically reduce such underestimates 
of (long-)segment frequencies. (Of course, it should be examined whether these expectations 
are indeed the case or not ,e.g., via in silico experiments.) 

It should be noted here that the method presented in this paper is applicable to any 
natural continuous-time Markov models of insertions/deletions, with any (yet biologically 
meaningful) total rates and any length distributions of insertions and deletions as long as 
the rates and distributions are uniform at least within each gapped segment (, and as long 
as the entire sequence evolution model satishes the factorability conditions [3]). This means 
that, in principle, the method may be applied also to indel length distributions other than 
the power-law distributions {e.g., [39, 40, 41, 42, 43, 44, 45]), which in a sense represent 
’’average” behaviors, to incorporate more ’’genome-specific” or ’’region-specihc” behaviors. 
For example, insertions of transposable elements e.g., [68, 69]) may be incorporated; the 
simplest way to do this would be to merely add a delta-function-like spike in the insertion 
length distribution; a more refined way may be to allow the insertions at inter-site positions 
whose flanking sites show some specific ’’motif’s, or their approximations with runs of an- 

■^^These should not be confused with parsimonious (and maybe plus next-to-parsimonious) indel histories , 
because the former can result in indel histories with practically unlimited numbers of indels, albeit following 
strict constraints at all internal nodes. 

^^The current version of LASTPIECE(_P) has not implemented this feature yet. 
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cestry indexes (assuming that the ’’motif’s are nearly conserved during the time-interval in 
question). Another example would be the incorporation of the evolution of tandem repeat 
arrays (like micro-satellites and mini-satellites) {e.g., [70, 71, 72]). It may be harder to incor¬ 
porate it, because tandem arrays show complex evolutionary patterns {e.g., [70]). However, 
provided that their evolution can be well-approximated with continuous-time Markov models, 
and provided that the model allows the decomposition into the ’’base” and ’’perturbation” 
parts (as in subsection 4.2 of this study), the method in this paper should be applicable 
(maybe with some modihcations). 

Incidentally, the creation and deletion of ”D(escendant)” and ” A(ncestral)” segments in 
this new ’’perturbation” method (as seen in section 2) may be reminiscent of the ’’birth- 
death” processes of single sites in the TKF91 model [35]. In this sense, this new ’’perturba¬ 
tion” method may be considered as an ’’extension” of TKFQl’s ’’birth-death” approach [35] 
to realistic models of insertion/deletion, although we are not sure yet as to how rigorously 
this ’’extension” makes sense. In any case, delving further into the theoretical side of this new 
’’perturbation” method may lead to some (analytical or numerical) methods that are much 
faster and/or much more accurate than the numerical computational method presented in 
this paper. 

To summarize, the new perturbation method constructed in this study enable us to com¬ 
pute case-(iv) probabilities fairly accurately, and thus providing the ’’last piece of the puz¬ 
zle” of accurately computing the probabilities of ancestor-descendant PWAs under genuine 
sequence evolution models (with any natural length distributions of insertions/deletions). 
Combined with our new program package, ”ANEX_P,” to approximately compute the prob¬ 
ability distribution of alternative MSAs [56], this study (and its main product, ”LAST- 
PIECE_P”) will open up the possibility of the truthful reconstruction of sequence alignments, 
and thus for more accurate evolutionary analyses of homologous biological sequences. 
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7.1 Final Note 


Some of our comments in this paper or other papers may sound like harsh criticisms on 
other researchers or their works. We strongly urge the readers to understand that such 
comments are our candid expressions of our sincere and pure hope for the advance of the 
science in the right direction, and that we have no intension to attack, harm, or hurt anybody 
or anybody’s works. It should be kept in mind that we, all hard-working researchers in the 
world, are not enemies to each other but actually comrades to each other, who are hghting 
against the common enemies, i.e., insufficient understanding of the Mother Nature and the 
lack of tools potent enough to uncover the essence of natural phenomena, as well as being 
complacent of the status quo like that. We truly hope for the future where we, all researchers, 
go hand-in-hand with each other to improve our understanding of the Mother Nature, by 
bringing together ones’ own strengths under the common cause instead 0 /competing against 
each other or even sabotaging each other’s studies , and by sharing all information with 
each otherinstead 0 /keeping crucial information to oneself. Then, our understanding of the 
Mother Nature should surely improve much faster than we’ve ever experienced. (If, however, 
there are, by any chance, corrupt researchers who are indulging in the complacency and/or 
who attempt to deform the scientihc truths to their own interests, we will resolutely fight 
against them.) 
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Appendixes 


A Iteratively Solving Integral Equations for ’’Basic” 
Multiplication Factors, Eq.30 


In subsection 4.2, we provided a very fast method to compute the ’’basic” multiplication 
factors (Eq.30) in a practically exact manner, by directly approximating the dehnition of the 
hnite-time evolution operator (Eq.31). Here, we describe another approach to the compu¬ 
tation, based on the iterative solution of the integral equations, like the practically exact 
computation of case-(i), (ii), and (iii) multiplication factors that we derived previously [1]. 
Although this approach is somewhat slower than the definition-based approach, it has an 
advantage that the upper-bound of the number of insertions/deletions (indels) incorporated 
can be controlled totally independently of the accuracy of numerical computation of the 
time integrations. Therefore, the approach provided here should be more suitable than the 
dehnition-based approach when examining, e.g.^ how the accuracy of the computed multipli¬ 
cation factors depends on the maximum number of indels considered, as we did previously 
regarding case-(ii) and (iii) multiplication factors [1]. 

First, by sandwiching the integral equation,Eq.(R4.5) in a previous paper of ours [3] 
(rewritten for t^+i)), with {AL{r, I) \ and | AL{i]F)), and by explicitly recording 

the dependence on aBE{i) ('= o'b(*) + o'Eii)), we get: 




= 5{AL{i; I), AL{i; F)) exp 


^k+1 


dtAWj^{AL{i-I),t) 


Another possible merit would be that the method provided here involves time integrations, which could be 
done via a more accurate algorithm, whereas the definition-based method involves multiple-multiplications to 
approximate an infinite-multiplications; (I do not know any methods that substantially improve the accuracy 
of the infinite-multiplications over multiple-multiplications.) 
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l9o 


(AL(i;/) -l + aBEii))Y^ 


‘'fc + 1 


dt 


exp 




i=i "fe 


xgi{l,t) X [(AL(i;F),4+J | (AL(i;/) +/, t)] 

mm{AL{rJ) — l,L^^) 


E 

1=1 


{AL{i; I) - I + 1) 


'"fe+i 


dt 


exp < — / drAR^^ (AL(i;J),r) 


X 


xgD(i, t) X [(AL(s; F), i^^j) | {AL{i; I) - I, i)] ((Tbb(!)) 


(56) 


Here, 5{AL, AL'I) is Kronecker’s delta, which equals 1 if AL = AL'I, and 0 otherwise; g[{l, t) 
is the rate of an insertion of length / between a given pair of contiguous sites at time t, and 
gnilR) is the rate of the deletion of a specihc sub-sequence of length I at time t. Here, we 
also introduced the cut-off lengths, and , for insertions and deletions, respectively. 

As in section SM-3 of a previous study of ours [1], the above integral equation system 
can be solved by iteration. The starting point is the ’’zero-event approximation” : 


d'pl [(^-^f,4+i) I [asE) 


= 5{ALt, ALf) exp 


‘'fc+1 


drAR^xiALp r) 


(57) 


Here and below, we use the ’’short-hand” notations, ALf for AL{i-, F) ALt for AL{i] t), and 
apE for aBEii), purely for clarity. And let [{ALfR^) \ {ALt,t)] be the approximation 
at the ns th step, which includes all the responsible indel histories (in the ’’base” rate 
operator) with ns or less indels. Then, by solving the following recursion relation, we can 
improve the approximation one by one, to the accuracy level we desire (provided that we 
have enough time to do so): 


[{^LF,tf,_^_t) I {ALt,t)] {(Tbe) 


'"fc+i 


6{ALt, ALf) exp — J drARx i^Lt^r) 


LfO 


{ALt — 1 + (Pbe) 


‘'fc+l 


dt' 


1=1 


exp <( — / dTAR^x{ALt,T 


X 


xgi{l,t') X [{ALfRi^p-^) I {ALt + l,t')] {aBp) 
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min( ALt — ) 

5^ (AL,-/ + 1) 


‘'fc+1 


dt' 


exp 


drAR^x {^Lt,T) y X 


1=1 


^goilR') X li'p^ [{ALpRj^j^^) I (ALt - lR')\ (ctbe) 


(58) 


As in section SM-3 of a previous work of ours [1], the above recursion relation could be 
rewritten as: 


[(ALf,4_^i) I (ALt,f)] (aBE) 

rtZ 


fc+i 


= (5(ALt, ALi?) exp <( — y drARx (ALt^r) 


+ 


Lk+i 

r ( 

/ dt' 

exp < 

ft 

^ 1 


r>i' 


exp i — I dTAR{P{ALt,T) x 


{aBE) 

with the ’’auxiliary function”, 

[{^LfR^pi) ; (ALt,L)] (ube) 

LCO 

= {ALt — 1 + aBE) ^ ^gi{f-, t') X [{ALf, tfc+i) I (ALt + I, i')] {o'be) 


(59) 


z=i 


iain{ALt — l,L^^) 

+ X] ~ ^ X [{ALfR^+i) I - ^>^0] {(^be) ■ (60) 

i=i 

As an additional note, although we have formally dealt with AR{p{AL,t') thus far, it is 
actually a linear function: 


AR^x^{AL,t') = x AL 

LCO 

with Lg°, t'] = gi{l, t') + Y 9d{U t') 


(61) 


Lgo 


1=1 1=1 

under the locally space-homogeneous model. Knowing this fact will considerably speed up 
the computation. 

For a hxed ALf, the above system of recursion equations, Eq.59 and Eq.60 with rang¬ 
ing ALt and t (and t'), could be numerically solved via an algorithm of time-complexity 
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0{Nj]:)L^^{L'"'^ + Np)Np) and space-complexity 0{L^'^Np), just as in SM-3 of [1]. Here, 
is the upper-bound of ALt (and ALp), Np is the number of sub-intervals into which the 
time-interval (or rather [tijtp]) is partitioned, and Njp is the maximum number of 

indels that we take account of. Because the recursion equation systems with different ALp^s 
can be solved independently of each other, the computation could easily be parallelized (or 
rather, cast into distributed computing). And each computation should take as much time 
as in the case of isolated gaps (i.e., Eqs.(SM-3.4a,b) in [1]). Therefore, when is consider¬ 
ably large, say, 1000, although it may take too much time to numerically solve this recursion 
equation system with a single CPU (or core), the computation could hnish typically in a few 
hours if you can access, e.g., hundreds of CPUs simultaneously. Moreover, once computed, 
the results could be stored and re-used for various analyses under the same evolution model. 
In this sense, this computation itself could be feasible. 

By substituting the solutions of the recursion equation system, Eq.59 and Eq.60, into 
Eq.29, we can obtain the practically exact ”zero-th approximation” of the transition proba¬ 
bilities within each slice. 


B Backward-Extension to Directly Approximate Defi¬ 
nition of Finite-Time Evolution Operator, Eq.31 

In subsection 4.2, we directly approximated the dehnition of the hnite-time evolution operator 
(Eq.31) forward, i.e., from the initial time (tj) to the hnal time (tp). 

Alternatively, we could extend the time-interval backward, from [tp,tp]. In this case, we 
obtain the following recursion relation: 

[{ALp,tp) I {ALj,tj)] {(TpEii)) 

= (^1 - A^pt ■ AR^x tj+i) j ■ p^pj"^ [{ALp, tp) I {ALj, tj+i)] {apEii)) 
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min(LfO, L^'O-ALj) 

+ANpt ■ {ALj — 1 + (JBE{i)) X I gi{^,tj+l) X 

1=1 

[{ALF,tF) I {ALj + l,tjFl)] {o'BEi'i)) I 

min(L^^, ALj — 1) 

+ I {ALj — / + 1) X gBi{l,ijFi) X 

z=i 

[(^-^F,Af) I (^-^i — ^:tjFl)] {<JBE{i)) I , (62) 

with the ’’initial” condition: 

[(ALjT’jti?) I (ALtvp, tzvp)] {o'BE{i)) = S{ALf, ALnp) ■ 

This recursion relation is valid for ALj = and ALf = 1,..., for reasons 

similar to those given below Eq. 36 (in subsection 4.2). 

C Iteratively Solving Integral Equation for Finite-Time 
Transition Probabilities when ’’Boundary-Eroding” 
Deletions are Switched on 

Here, let us formally attempt to compute the hnite-time transition probabilities when the 
’’boundary-eroding” rate operator, QM-.B-er^f) (in Eq-21), is switched on. 

In Eq.24, QM-.B-eA^) expanded into the summation of operators, Qm-.b-sAA acting 
only on individual region-boundaries. Thus, we will define a series of ’’perturbed” rate 
operators: 

Q'£y) = Q^t). 

=' <?B°er~‘’(<) + 0M:B-.,(*;<) (i ^ 1,I^c(k) - !)■ (63) 

Then, the ’’perturbed” rate operator, A) ~ Qo^A) + QM:B-er(^)) incorporates all 

’’boundary-eroding” deletions. Using the above series of’’perturbed” rate operators, and fol- 
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lowing the same procedure as when obtaining the fundamental integral equation, Eq.(R4.5) in 
[3], we can derive a series of integral equations for t') T |exp dr |- 




dr P, 




{^1 P Qm-.B-b 




with f = 1, ...,Nc{k) — 1. Each of these integral equations could be solved by iteration, as 
we solved the integral equation, Eq.56, above. Because the perturbation term, QM-.B-erik ^ 
in each of the above integral equations acts only on two contiguous colored regions, we need 
to track only changes in the lengths of the two regions; this may signihcantly reduce the 
working space-complexity. At least theoretically, if you will, you could directly solve the 
” one-shot” integral equation: 


t') = Po^(t, t') + 1^ dr r) Qm-.b-sAP ^B-^r(P P , (65) 

where we set PiA^AtP') (= T jexp // dr (^Qo^(r) + QM-.B-evip) })• 

In practice, a possibly big problem is that solving this equation could require a tremendous 
amount of memory, because you need to keep track of changes in the lengths of all colored 
regions. Provided that you have adequate resources to do that, of course, solving Eq.65 alone 
may be preferable to solving Eq.64’s serially. In any case, Eq.64 and Eq.65 are identical if 
the slice in question consists of only two colored regions, and they do not even exist if the 
slice has only one region. 


D Time-Efficient Computation of Contributions from 
2nd-order Pattern (i): A — ” 0 ” —> D 

The contributions from the 2nd-order pattern (i), Eq. 49, were given in subsection 5.1.1. 
As argued at the bottom of subsection 4.5, it would save time to perform the calculation 
associated with individual events one after another. 
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In the case at hand, this could be done in two ways. The hrst method performs the 
integration over {ti + dti <)ti(< t2) hrst, and then integrates over {ti + dti <)t2(< tF — dt2). 
It can be written as: 


/^ptise-(iv) tp) I (A/L^, ti)] X exp ( - I dTAR{f{AiLA,T) 

tF — dt2 


rtp 


dU 


hA ”0” [{AiLaRi) ; (0,0) ; t2] {(Tbe — 2) x 


X (/^D-cr [(a^ = 1, [^2, h + dt2]) ; {AfLd, tp)] / dt2) , 

with the ’’extended” multiplication factor, pA -5> ”0” [■■■], dehned as: 


( 66 ) 


def 


hA ”0” [(A/La,^/) ; (0,0) ; t2\ {cbe — 2) 

rt2 r 


dti 


(hA-dei [(A/La, t/) ; ( 0 , 0 , [fi — dti, ti])] ((Tbe — 2 )/(iti) X 


Jtj+dti ^ 

X/Xp 

case- ( 1 ) 1 (« 2 . 0 ) I (il.O)] 


( 67 ) 


By (numerically) pre-computing Eq .67 with t2 G [tj + dtiRp — dt2] and with AjLa = 
1, 2,..., and storing the results, we can avoid the repeated computation, which could 
occur if we directly compute Eq. 49 . 

The second method performs the integration over {ti <)t2{< tp — dt2) hrst, and then 
integrates over (f/ -|- dti <)ti{< tp — dt2). We have: 


/^ptise-(iv) tp) I (A/La, ti)] X exp dTAR^£{AiLA,r) 


f*tF—dt2 


dti 


(M-dei [(A/La, tj) ; (0, 0, [fi — dti, ^i])] (o'pp — 2 )/dti) x 


J tj+dti 

x/i”0” -s. D [^1 ; a; = 1 ; {ApL^Rp)] 


( 68 ) 


with the dehnition: 


def 


A''”0” -s. D [^1 ; a; — 1 ; {ApLuRp)] 

ptF—dt2 


dU 


lip 

case-(i) l(«2,0)|(i„0)]x 


X (pD-cr [(a^, [^2, t2 + dt2\) {ApLp), tp)] / dt2) 


( 69 ) 
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Similarly to the 1st method, the pre-computation of Eq.69 will avoid the repeated compu¬ 
tation encountered in the naive (numerical) computation of Eq.49. 


E Time-Efficient Computation of Contributions from 
2nd-order Pattern (ii): A —> AD —> D 

The contributions from the 2nd-order pattern (if), Eq. 50, were given in subsection 5.1.2. 
As usual, it should be time-efficient to perform the computations associated with individual 
events separately, one after another. 

In this case, such ’’computational factorization” is done as follows. First, we (analytically) 
define the following ’’complex” multiplication factor (of 0{dti ■ dt 2 ))'. 


def 


ho-cr A-del [ti, ti + dti]) ] (AL 2 , [^2 ” dt2, ^ 2 ])] {<JbE — 1) 

00 

[/iD-cr [(a: = ALi, [tifii + dti]) ; (AL2 6AL2, ^2 - dt2)] x 


5AZ/2=0 
X exp 


f*t2 


dT2 AR^fi^{AL2 + 6AL2,T2) X 


't2—dt2 


X/iA-del [(ALi, fi) ; (0, —(5AL2, [^2 — dt2, ^ 2 ])] (cTbe — 1) 


(70) 


which, after being divided by dti ■dt 2 , gives the (hnite) probability density that a ”D”-region 
was created between x and a; -|- 1(, which is on either end, in any case,) of an ”A”-region 
of size ALi, immediately after ti, and that the ”A”-region (evolved with asE = 1) was 
completely deleted (immediately before t 2 ), leaving the ”D”-region of size AL 2 at time t 2 - 
When dealing with a time-homogeneous model, the collection of these multiplication factors 
with ranging ALi, AL 2 and t 2 — ti can be nnmerically compnted with the time-complexity 
of 0{Np{L^'^}^), because the summation, X^L 2 =o should be of time-complexity 

(in numerical computation). Regarding the space-complexity, storing the inpnts, /iD-cr['■ ■] 
and PA-dei[■ ■ ■], reqnires the memories of 0{NpL^'^) and 0{Np{L^'^}^), respectively, and 
storing the ontpnt, po-cr - 5 > A-dei[- ■ ■], also reqnires the memory of 0{Np{L'"^}'^). Thus, the 
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space-complexity for this definition is 

As Figure 11 indicates, the aforementioned /io-cr A-dei[- ■ ■] covers the ’’amputated” por¬ 
tion of the coloring-pattern evolution, after the initial period of a single ”A” and before the 
final period of a single ”D”. We will express this fact with the equation: 

(Fptase-(iv))^^p [(^^2, ^ 2 ) | (ALi, ti)] X {dti ■ (^ 2 ) 

= hD-cr-5> A-del [(ALi, X, [fi, ti -|- dti]) ; (AL 2 , [t2 “ , ^ 2 ])] (o'PP = 1)• (71) 


Here, the subscript, ”Amp.”, on the left-hand side stands for ’’amputated”. Now, using this 
’’amputated” multiplication factor, Eq.50 is rewritten as: 


F?tase-(iv) tp) \ (A/La, t/)] X exp ( - / dTAR^^{AiLA,T) 


rtp 




ALi=l AL2=1 


ctp—dti—dt2 


f*tp 


dti 


dto 


' t\-\-dt\-\-dt 2 


hPo [(AAi,fi) I (AjLaRi)] {cbe — 2) X 


x(41Aiv))*„p.l(^'E'2. ‘ 2 ) I (Ail, «i)]x 

XhPo [(ApF^), tiT’) I (AL2,t2)] {<fbe = 2) . 


(72) 


This equation tells us that the full multiplication factors for the 2nd-order evolution 
pattern (ii) can be obtained by ’’extending” their ’’amputated” version, both to the initial 
time and to the hnal time, each via ’’convolution” with irpo[- ■ ■]{o 'be = 2). 

Eq.72 can be made further time- and space-efficient by separately performing the compu¬ 
tations associated with (ALi,ti) and (AL 2 ,f 2 )- First, we (analytically) define the following 
’’extended” version of the ’’amputated” factor: 


def 


2nd (ii) 

Fp case-(iv)^^^p ^^^^ _j 
°° pt2-dti-dt2 


[(AL 2 , t 2 ) I {AjLa, tl)] 


E 

ALi=l ■ 


dti 


't, 


Fpo [(ATi,ti) I {AiLaRi)] {ube — 2) x 


X t,) I (AL„ i.)] 


(73) 


Here, the subscript, ”Amp.&Ext.-I”, stands for ’’amputated and extended to the initial time 



{i.e.. ti)” . Using this ’’extended” version, Eq.72 can be rewritten as: 


/^ptise-(iv) [(^fLd, tp) I {AiLa, ti)] X exp ( - / dTAR^£{AiLA,T) 


oo rtp 

a: I 


<M .2 [fip case-(iv) 


Amp.&Ext.-I 


[(AL2, ^2) I {AjLa, ti)] X 


x/ipo [{AfLdRf) I (AL2,t2)] (cr_B£; — 2) . 


The numerical computation of the dehnition, Eq.73, with ranging AjLa, AL 2 , and t 2 — ti, re¬ 
quires the space-complexity of 0{Np{L^^}‘^) and the time-complexity of 0{{Np}‘^. 
And so does the numerical computation of Eq.74 with ranging AjLa, ApLp and tp — tp 
The 0{{Np}‘^{L^'^}^) time-complexity may be relatively time-consuming. Once the in¬ 
put factors become available, however, the computation for different combinations of the 
region-lengths, {AjLa, ApLp), can be done independently, and thus this computation can 
easily be parallelized (or cast into distributed computing). (For example, if these computa¬ 
tions with different AjLa^s are distributed, time-complexity of each computation reduces to 
0{{Np}‘^{L^^}‘^), while its space-complexity remaining the same.) Thus, you may manage 
to numerically compute these contributions from the pattern (ii). 

As in appendix D, the order of the time-integrations may be reversed, integrating over 
(ti + dti + dt 2 <)t 2 {< tp) hrst and then integrating over (U <)ti(< tp — dti — dt 2 ). In this 
case, we (analytically) define the ”amputated-and-extended” factor: 


I (Ail. *i)] 

ftp p / s 

X] / ^^2 (/^?case_(iv)) [(^^ 2 , ^ 2 ) | (ALi, ti)] X 

XhPo [{ApLpfip) I (AL 2 ,t 2 )] {<fbe = 2) . (75) 

Here, the subscript, ”Amp.&Ext.-F”, stands for ’’amputated and extended to the final time 
{i.e., tp)” ■ Using this, Eq.72 can be rewritten as: 

/^ptise-(iv) [{AfLd, tp) I {AjLa, ti)] X exp (^- dTAR^^{AiLA,T'^ 
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°° I'tp—dti—dt 2 


fJ-Po [(^-^1,^1) I {aBE — 2 ) X 


^ / dti 

ALi = l 

/^ptase-(iv)) [(AirLD,tF) | (ALi, ti)] 

^ V Amp.&Ext.-F 


X 


(76) 


The numerical computations resulting from this reverse-order procedure also have the time- 
complexity of and the space-complexity of and can easily 

be ’’parallelized”, e.g., by distributing the computations with different ApLp’s. 


Here, for later use, we give an ’’alias” to ^hp^tase-(iv)) 


■ , as follows: 


Amp.&Ext.-F 

hD-cr- 5 > AD- 5 > D [(ALi, x, [fi, ti-|-dfi]) ; (); (ApLp, tp)] (upp = 1) 

= I (AP, «.)| X dt, . 

This will be used as an ingredient for the 3rd-order pattern (a). 


(77) 


F Time-Efficient Computation of Contributions from 
3rd-order Pattern (a): A —> ADA —> AD —> D 


The contributions from the 3rd-order pattern (a), Eq. 52, were given in subsection 5.2.1. 
These contributions can be computed in a time-efficient manner, as in appendix E. 

First, we observe that, during the time interval, [fi, fa —dfa], if we ignore the left-fragment 
of the ”A”-region, the remaining portion (consisting of the ”D”-region and the right-fragment 
of the ” A”-region) is identical to those dehning the factor, /io-cr - 5 > ad d [■ ■ ■] dehned in Eq.77 
(, which is 0{dti) here,) with some modihcations on the arguments. Thus, we can formally 
perform the time-integration over {ti + dti + dt 2 <)t 2 (< ts — dt^) and the summations over 
6 AL 2 and AL 2 , to get: 


AfSlov) [(ApiDjiF) I (A/La,*/)] X exp (^- dTAR‘£(A,LA,T) 


Z Z E E , 

A^]^L \=1 A/^Ti=l /\L^ = 1 (5AZ/3=0 


ntp—dti —dt 2 —dt^ 




dti 


dU 


ti+dti+dt 2 +dt 3 



f^Po [(^-^1 ~ ^lLi + AjiLi,ti) I (A/L^jt/)] {asE — 2) x 


X (/^D-cr -5> AD -5> D [{^rLi, X — A^Li, [ti, ti + dti]) ; 0 ; (AL 3 + 5AL^, — dt^)] {(Tbe-.r — 1)/ dti) x 


X exp (- [ drs AR^^{ALs + SALs,T 3 ) \ x 
\ Jt3—dt3 / 

X (/iA-del [(AlLi, ti) ; (0, —(5AL 3 , [^3 — (its, ta])] ((TbE:^ = l)/(it3) X 

x/ipo [(ApLpjtp) I (ALsjta)] ((Jpp = 2 ) . 


(78) 


Second, we lump together the factors, po-cr - 5 > ad - 5 > d[- ■ ■] and pA-dei[- ■ ■], as well as the 
exponential factor in between, and perform the summation over 6 AL 3 . This gives a set of 
multiplication factors (of 0{dti ■ dts)) that depend on the lengths ApLi, ApLi and AL 3 , 
and the times, ti and thus requiring the space-complexity of 0 {{Np}'^{L'"^Y) if you 
want to store all of them. Fortunately, when the evolution model is time-homogeneous, the 
factors depend on the times only through ts — ti. Besides, in the computation of Eq.52, all 
we need regarding the dependence on (ApLi, ApLi) is through ALi = ApLiR ApLi. Thus, 
we could reduce the space-complexity to by dehning the following ’’complex” 

factors(, which are of 0 {dti ■ dts)): 


hD-cr -5> A-del -5> A-del [(ATi, •, + dti]) () ! (^-^3; [^3 ~ dt^, ^ 3 ])] {(TbE-.R — <^BE:L — 1) 

ALi —1 00 

= 5 : E [ 

A^Z/i=l (5AL3=0 

hD-cr - 5 > AD - 5 > D [{ArLi,x = ApLi, [ti, ti + dti]) () ; (AL 3 -|- SAL^, fa — dt^)] {(Jbe-.r = 1 ) x 


X exp — 


rh 


drsAR^jPiALs + dALs^n) ) x 


hz-dtz 


XftA-dei [(ApLi — ALi — ApLi,fi) ; (0, —(JALa, [fs — (its, fa])] ((Tpp.p — 1) 


(79) 


These factors for ranging ALi, AL 3 , and fa —fi can be numerically computed with the space- 
complexity of 0 {Np{L^'^}‘^) and the time-complexity of 0 (Ap{L'"‘^}^), and the computation 
is easily cast into parallel or distributed computing. (For example, if the computations with 
different ALi’s (or ALa’s) are distributed, each computation has the time-complexity of 
0 {Np{L^^}^), albeit with an unchanged space-complexity.) 



Now, by referring to Figure 13, we notice that the aforementioned /xo-cr -5> A-dei -5> A-dei[- ■ ■] 
covers the ’’amputated” portion of the coloring-pattern evolution for the 3rd-order class (a). 
Thus, we can define the ’’amputated” version of Atp^case-(iv)['''] 

[(^^3, ts) I (ALi, ti)] X (dh-dts) 

= ho-cr -5> A-del A-del [(ALi, •, [tifii + dti]) ] () ; (AL3, [^3 — dt^, ^3])] {<J bE:R = BE:L = !)• 

(80) 

Then, the rest of the computation procedure for this pattern (a) is prescribed with the 
equations almost identical to Eq.72 and Eqs.73 & 74 (or Eqs.75 & 76). The only differ¬ 
ences are: (1) the superscript, ’’2nd (ii)” should be replaced with ’’3rd (a)”; (2) (AL 2 ,t 2 ) 
should be replaced with and (3) <^^2 should be replaced 

with dti ^'^d, of course, arguments on the space- and time- 

complexities also hold in the same way. 


G Time-Efficient Computation of Contributions from 
3rd-order Pattern (c): A —> DA —> DAD —> D 

The contributions from the 3rd-order pattern (c), Eq. 53, were given in subsection 5.2.2. As 
in appendix G, the equation can be cast into a series of equations to time-efhciently calculate 
these contributions. 

First, we deal with the creation of the ”D”-region on the right (in [^ 2,^2 + dt 2 \) (and a 
part of the complete deletion of the ”A”-region (in [G — dtsfis])), in two steps. In the first 
step, we dehne the following complex factor(, which is 0 {dt 2 ■ dtfij)\ 


hD-cr(R) -5> -A-del [(A2T^, X2, [^2, ^2 + ^^2]) ; {—dA^LBL: ^ 2 ,LdR, [G ~ dt^, G])] (cTpps — 0 ) 


def 


P'D-cr [{x2, [^2, G + dt2]) ; {A-^Ldr + SA^Ldr, h — dts)] X 

(5A3Lj3fl=0 
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X exp 


rta 


drs AR^x {^zLdr + SA^Lj:,ji,T3) ) X 


•tz-dtz 


X/^A-del [(A2i^A,^2) ; (0, —SA^Ld — —{SA 3 L DL + SA^Ldr), [ts — dts, ta])] {asE — 0) 


(81) 


where X 2 (= A 2 LD 1 + A 2 LA) only serves as a reminder of the insertion position of the 
”D”-region. With ranging A 2 LA, SA^Ldl, A^L^r and fs — ti, this computation has the 
time-complexity of 0{Np{L‘"^}‘^) and the space-complexity of 0{Np{L^''^}^). If, however, 
we only memorize the output, e.g., for each hxed A^Lpp, and output the results each time 
when all the factors for the fixed A^Lpp are computed, the space-complexity reduces to 
0{Np{L^^}‘^). (The same holds true for the parallel (or distributed) computing, in which 
case the time-complexity also reduces to 0 {Np{L^‘^}^).) 

In the second step, we dehne its ’’extended” version(, which is 0{dt^))\ 


def 


/^■A -AD -5> -A-del [{AiLa, ti) ] X 2 ] { — dA^Lpi-, A^Lpp, [^3 — dt^, ts])] {(JbE2 — 1, O'BES — 0) 
r-t3—dt2—dt3 


A2 La = 1 


dto 


hPo [(^2-hA,^2) I (^l-hA,^l)] {<^BE2 — 1) X 


X (/^D-cr(R) -5> -A-del [(^2-^^, ^ 2 , [^2, h + C?^2]) ! {—SA^Lpp, A^Lpp, [t^ — dt^, ts])] {apES — 0)/(it2) • (82) 


With ranging AiLa, SA^Lpi, A^L^p, and ts — ti, this dehnition can be computed with 
the space-complexity of 0{Np{L^^}^) and the time-complexity of 0{{Np}‘^{L^'^}‘^). If, 
for example, the computations with different A^Lpps are distributed (or parallelized), 
each computation has the space-complexity of 0 {Np{L^^}‘^) and the time-complexity of 

0({Ap}2{L^O|3)_ 

Using the above dehnition of /t.a -s- -ad -s- -A-deil' ■ ■], the expression we desire, he., Eq.53, 
reduces to: 


[{^fLd, tp) I (A/La, ti)] X exp (^- dTAR^£{AiLA, r) 

^ ^ ^ ^ ptp—dti—dt2 — dt^ ptp - 

= 

AiLa =1 A 3 LDi=l A 3 LDfl=l < 5 A 3 Loi =0 Jt^+dt^+dt2+dti 


d'Po [(^l-^A,ti) I (AiLaRi)] {cbe — 2) X 

X (/iD-cr [(ail = 1, [ti, ti -l- dti]) ; {A^Ldl + dA^LpL, ts — dt^)] / dti) x 
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X exp 




drs Ai?;^(A 3 L£)i + (JAsL/ji, Ta) ) x 


'h-dti 


fd-A - 5 > -AD - 5 > -A-del [(AiLa, ti) ; X 2 ] (—(JAsL^L, A^LdR, [ts — dt^, ts])] {aBE2 — 1, (^BE3 — 0)/dt^^ X 


x/^Po [(A^L^), ti?) I (AsLd — A^Ldl + A^Ldr, ts)] {asE — 2) 


(83) 


Next, we deal fully with the complete deletion of the ”A”-region (in [t^ — dt 3 ,ts\). Al¬ 
though, in the current setting, the results of the deletion depend on A^Lbl and A^L^r, 

def 

what we dually want is the function of the total length, A^L^ = A^Ldl + A^Ldr, of the 
”D”-region after the deletion. Hence, we dedne the following complex factor(, which is 
0{dti ■ dtz))'. 


/^D-cr(L) -5> D-cr(R) -5> A-del [(AiLa,^! — 1, \ti,ti + dti\) ; X 2 {A^Ld, [h — ^^ 3 ,^ 3 ])] {o'bE2 — 1,0'BE3 — 0) 

^3^0-1 „ / pts 


def 


Z E 

A3L_dl = 1 SAsLj:ii^=0 


exp 


dxs AR^ {A^Lbl + dA^LBLiTs) x 


'ts-dti 


x/iD-cr [(a^i = 1, [H, h + dti\) ; (A^Lbl + SA^Lbl, h — dt^)] x 

X/i-A - 5 > -AD - 5 > -A-del [(AiLa, ti) ; X2 

; {—SA^Lbl, ^sLdr = — ^sLdl, [ts — dt^, fa ])] {o'be2 = 1 , o'bes = 0 ) 


(84) 


With ranging AiLa, A^Ld and fa — fi, this computation has the space-complexity of 
0{N and the time-complexity of 0{Np{L^^}'^), which may be quite hard on a 
single computer. However, if computations with, e.g., diderent AiLa’s are distributed (or 
parallelized), the space- and time-complexities of each computation reduce to 0{Np{L^^}‘^) 
and 0{Np{L'^‘^Y), respectively. 

Now, Figure 14 suggests that the aforementioned /iD-cr(L) -s- D-cr(R) -s- A-dei[' ■ ■] covers the 
’’amputated” portion of the coloring-pattern evolution for the 3rd-order class (c). Thus, we 
have the ’’amputated” multiplication factor: 

(Fptlsl-(iv))^^p [(AsTd, fs) I (AiLa, fi)] X (dfi ■ dts) 

^^The entire result of this definition is of size 0{Np{L'^^}^), thus may be stored in the memory of a 
common contemporary computer. 
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— A''D-cr(L) -5> D-cr(R) -5> A-del [(^1-^A, Xi — 1, [ti, ti + dti]) ] 

^2 {^3Ld,[^ 3 ~ dts,'t3])]{(^BE2 = ^, CTbeS = 0) ■ (85) 


As in appendix F the rest of the computation procedure for this pattern (c) is prescribed 
with the equations almost identical to Eq.72 and Eqs.73 & 74 (or Eqs.75 & 76). The 
only differences are: (1) the superscript, ’’2nd (ii)” should be replaced with ’’3rd (c)”; (2) 


(ALi,fi) and (AL 2 ,t 2 ) should be replaced with (AiL^, ti) and {A^Ld, h), respectively; and 


(3) dti dt 2 should be replaced with 


rtF—dti — dt2 — dtz 1, rtp 1, 

Jtj ^ Jti-l-dti-l-dt2-hdt3 


Again, arguments on the space- and time-complexities remain the same. 


H Time-Efficient Computation of Contributions from 
3rd-order Pattern (e): A —> DA ^ DA —> D 

The contributions from the 3rd-order pattern (e), Eq. 54, were given in subsection 5.2.3. 
Here is a series of measures to quickly compute these contributions. 

First, paying attention to the ’’boundary-eroding” deletion, we dehne the following com¬ 
plex factor(, which is 0{dti ■ dt 2 ))'. 

/^D-cr -5> B-er- [{x = 1, [fi, ti + dt\\) ] {A 2 LD, —SA 2 LA, [^2 “ dt2, ^ 2 ])] 

00 

= ^ f^D-cr[{x,[tl,ti + dti]) ] {A 2 L 1 :) + 5A2LD,t2 — dt 2 )] X 

SA2Lo=^ 

xexp-f— [ dT 2 AR{P[A 2 Lb) SA 2 Li:),T 2 )\ X dt 2 gD{dA 2 LASA 2 Li:),t 2 ) ■ (86) 

I Jt2 — dt2 J 

With ranging A 2 TD, SA 2 LA and ^ 2 —H, this computation has the space- and time-complexities 
of 0{Np{L^^y) and 0{Np{L^^}^), respectively. 

Second, paying attention to the ’’boundary-eroding” deletion, again, we dehne the fol- 
lowing(, which is also 0{dti ■ dt 2 ))'. 

h(A -5>) D-cr(L) B-er [(AiL^, X = 1, [ti, ti -f dt\\) ; {A 2 LD, A 2 LA, [^2 — dt2, ^ 2 ])] {(TbEI = 1) 
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def 


1^ /^D-cr -5> B-er- — 1, [^1, + dti]) ; {A 2 LD, —5A2La, [^2 — C?t2, ^ 2 ])] X 

SA2L/i^=1 

x/ipo [{A 2 LA + SA 2 LA, t2 — dt2) I (AiL^, ti)] {asEi = 1) X 

rt2 


X exp 


dro 


ft2—dt2 


ARx {A2LA + 6A2LA, T2) 


(87) 


With ranging AiLa-, A 2 LD and A 2 LA, and t 2 — ti, this computation has the space- and time- 
complexities of 0{Np{L'"^}^) and 0{Np{L^'^}^), respectively. If, however, we distribute 
the computations, e.g., with different AiL^’s, the space- and time-complexities of each 
computation become 0{Np{L‘"^}‘^) and 0{Np{L^'^}^), respectively. 

Third, paying attention to the complete deletion of the ”A”-region (in [t^ — dt^R^]), we 
dehne the following complex factor(, which is 0{dts)): 


hoA -5> A-dei [{A 2 LP, A 2 LA, ^ 2 ) ; {A^Lp, [ta — dts, fa])] (app2 — 1) 

00 

= 1^ hA-dei [{A2LA,t2) ] (0, —SA^Lp, [fa — dfajfs])] {o'be2 = 1) x 

x/ipo [{A^Lp + dA^Lp, fa — dfa) | {A2LP, 12)] (cBEiD) = 2 ) x 
X exp I - f drs AR^^iA^Lp + SA^Lp, ra) 



( 88 ) 


With ranging A 2 LP, A 2 LA, A^Lp and fa — f 2 , this computation also has the space- and 
time-complexities of 0{Np{L^^}^) and 0{Np{L^^}^), respectively. If, however, we dis¬ 
tribute the computations, e.g.^ with different AaTp’s, the space- and time-complexities of 
each computation become 0{Np{L^^Y) and 0{Np{L'"^Y), respectively, as in the previous 
computation. 

Next, we further combine the complex factors, Eq.87 and Eq.88, and define the following 
higher-level complex factor(, which is 0{dti ■ dt^)): 


h(A -5>) D-cr(L) -5> B-er -5> A-del [(AiL^, X — 1, [fi, fi -|- dti\) ; () ; (A^Lp, [fa — dt^, fa])] (cTbEI — 0'bE2 — 1) 

OO OO — 

= E Z / '» A 

A2Ld = 1 A2La = 1 

(h(A -5>) D-cr(L) -5> B-er [{AiLa, X = 1, [fi, fi -|- dR]) ; {A 2 LP, A 2 LA, [f2 “ dt2, f2])] (o'SEl = 1)/ dt2) X 


X/iDA -5> A-del [(A 2 TD, A 2 La-, f2) ; {A^Lp, [fa — dfa, fa])] {o'BE2 — 1) 


(89) 



With ranging AiL^, ^sLd and this compntation has the space- and time-complexities 

of 0 {Np{L^'^}^) and 0 {{Np}‘^{L^^}'^), respectively; this is considerably hard on a single 
compnter. One big problem is the large size of the inpnt data, /i(A -s>) D-cr(L) -5> B-er[' ■ ■] and 
/iDA-5> A-dei[- ■ ■], each of which takes 0{Np{L^'^}^) memory space if yon want to store all 
of its elements. (In contrast, the ontpnt data, /i(A -s.) D-cr(L) B-er -5> A-dei[- ■ ■], take np only 
0{Np{L'"^}^) memory space.) The only way to avoid this problem of large space-complexity 
(nsing a single compnter) is to read in only /i(A _>.) D-cr(L) B-er[' ■ ■] with a particular AiL^ 
and only /tda ^ A-dei[- ■ ■] with a particular A^Lp, compute /i(A D-cr(L) ^ B-er ^ A-dei[- ■ ■] with 
this particular combination, {AiL^, A^Lp), and free the memory space for the input data 
after the end of this particular computation. Then, the required memory space reduces 
to 0 {Np{L^^}‘^), and repeating this procedure for all combinations of {AiLa, A^Lp) will 
give the complete dehnition, Eq.89. When performing parallel (or distributed) computing, as 
well, it should be wise to distribute the computations with different (AiLa, A^LpYs, not just 
those differing in ,e.g., AiL^’s alone, to reduce the space-complexity of each computation 
(also to 0(Ap{L^O}2)); in this case, the time-complexity of each computation also reduces 
to 0 ({Ap} 2 {L^O| 2 )_ 

Now, as Figure 15 suggests, the higher-level complex factor dehned in Eq.89 covers the 
’’amputated” portion of the coloring-pattern evolution for the 3rd-order pattern (e). Thus, 
we define the following ’’amputated” multiplication factor: 

[i^sLp, h) I (AiLa, ti)] X {dti ■ dtfi) 

/^{A —>■) D-cr(L) —>■ B-er —>■ A-del [(AiTa, X 1, [ti, t\ -\- dtl]) , () 

; {A^Lp, [^3 — dts, f3])]((Tppi = app2 = 1) • (90) 

Then, again, the remaining procedure is prescribed with the equations almost identical 
to Eq.72 and Eqs.73 & 74 (or Eqs.75 & 76). The only differences are: (1) the superscript, 
’’2nd (ii)” should be replaced with ’’3rd (e)”; (2) (ALi,ti) and (AL 2 ,f 2 ) should be replaced 
with (AiLa, ti) and (A^Lp, ts), respectively; and (3) dti *^^2 should be 
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replaced with dti '^^3- I'he space- and time-complexities 

of the computations are the same as in ’’2nd (ii)”. 


I Computing Theoretically Expected Frequencies of Gap- 
Configurations 


To validate the new perturbation method presented in this study, we computed theoretically 
expected frequencies of the conhgurations of gapped segments in the bulk (he., not on either 
end ) of ancestor-descendant PWAs, and compared them with their frequencies actually 
’’observed” in the bulk of simulated PWAs. Here we explain how the expected frequencies 
are computed. 

First, the multiplication factor, e.g., Eq. 47 for case-(iv) segments, was transformed into 
the probability that the segment, including both flanking preserved ancestral sites (PASs), 
occurs in the PWAs. This can be done by modifying the identity, Eq. 48, as follows: 


.Pcase-(iv) [{AfLoAf) I {AiLaAi)] 

Up 

case-(iv) [{AfLoAf) I {AiLaAi)] X exp 


ftp 


dr 


't, 


R^^{[L,RIt) + AR^fl^{AiLA,r) 


ID, 


(91) 


We know that, in (locally) space-homogeneous model, AR^;^{AiLa, t) = {gj (r)+^D(r))- 
AjLa holds [48, 3], where giir) (= QiQ'i't)) is the total insertion rate per site at time 

r, and 5 'd(t) ( 9d{1, t)) is the total deletion rate per site at time r. 

The question is: what on earth is the R^jfl{[L, R],t)7 Actually, the answer to this ques¬ 
tion can vary, depending on how we treat the ’’surrounding regions” flanking the PASs, 
L and R, especially, insertions occurring there (discussed to some extent in section R8 of 
[3]). For example, the evolution model used by Dawg [48] made a quite natural choice, 
considering that insertions occur also in these regions at the same rate as in the subject se¬ 
quence, and incorporating such insertions into the evolution of the sequence. In this model. 


96 



J^ID I ^ ^ {(/ - 1) gD(l,r)} + 2 (gj(T) + ^D(r)), where the 

summation in the 2nd term comes from deletions sticking out of the subject sequence [48]. 
Meanwhile, the ’’long indel” model of Miklos et ah [2] made another choice, giving such 
insertions the rates that are ’’mirror-images” of the rates of deletions reaching either end of 
the sequence, in order to keep the model time-reversible. (The expression of i?], r) 

for the ’’long indel” model is omitted here because it is somewhat complex.) 

In this study, we make yet another choice, that is, we don’t care at all whether any 
insertions occur or not in the ’’surrounding regions”. This choice should be appropriate for 
the problem at hand, because insertions in the ’’surrounding regions”, if at all, will never 
erode the gapped segment (including the PASs, L and R). Thus, the i?^([L, R],t) we want 
is: 

lco 

= gii'r)+ -^) gnihr)} + 2 goir) . (92) 

1=1 

(Incidentally, if we time-integrate this equation across a phylogenetic tree, the result becomes 
equivalent to the total summation of the exponentiated factors in Eq.(SM-7.1) of [1], as it 
should be.) 

Substituting these results into Eq. 91, we get: 


X 


Tcase-(iv) [{^fLdRf) I {^iLaRi)] 
fip 

case-(iv) [{AfLdRf) I {AiLaRi)] X 


exp< — 


f*tF 


dr 


'ti 


LCO 


9i{^)+ ^{{^-^) gD{l,r)} + 2 goir) + {gRr) + gnir)) ■ AjLa 


1=1 


(93) 


To compute the expected frequency of each conhguration of the case (iv) gapped-segment, 
which is uniquely specihed by {AjLa, AfLd), we simply multiply the probability, Eq. 93, 
by the total number, Nt, of regions that could potentially become the gapped segment in 
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question. Thus, we have: 


{Theoretically expected frequency of ^fLd)} = -^rX -Pease-(iv) | (A/Lyi,t/)] . 

(94) 

Although we discussed case-(iv) gapped segments as an example, we could also obtain the 
theoretically expected frequencies of the conhgurations of case-(ii) and -(iii) gapped segments 
in totally similar manners. 

In this study, we naively substituted the total number of ancestral sites, Nso x Lso, 
for Nrp. Here Nsq and Lso are the number of ancestral sequences and the length of the 
ancestral sequences, respectively. In this study, Nso = and Lso = Therefore, 

Nt ~ lO’*'® X = 10^®. There are at least two factors to consider in order to determine 
whether this naive measure and Eq. 94 work well. First, the actual number of available 
regions must be the number of sub-sequences embedded in the ancestral sequences. This 
should be Nso x {Lso — ^iLa — 1) for the conhguration, {AjLa, ApL^)- In this study, 
AjLa < 100 10^"^ = L^o- Therefore, the aforementioned actual number should be well 

approximated (albeit slightly overestimated) by Nso x Lso- Second, Eq. 94 exactly holds 
only when the available Np regions are independent of each other, whereas some nearby 
regions actually overlap each other; this poses the problem of auto-correlations. As long 
as the gapped segments are distributed sparsely enough within the sequences, such auto¬ 
correlations should be negligible. This must be indeed the case when the time-interval is 0.2 
or 0.5, although it may not necessarily the case when the time-interval is 1.0. Anyway, in this 
study, we will use the naive equation, Eq. 94, and solving the problem of auto-correlations 
will be left for future studies. 


brief consideration suggests that the auto-correlation here should cause an ’’exclusion effect”; two (or 
more) different gapped segment should never overlap each other, thus the frequency of actual occurrences 
should be somewhat lower than predicted by Eq. 94, provided that the probability is practically exact. 



References 


[1] K Ezawa. General continuous-time Markov model of sequence evolution via inser¬ 
tions/deletions: local alignment probability computation. BMC Bioinformatics., 17:397, 
2016. 

[2] I Miklos, GA Lunter, and I Holmes. A ’’long indel” model for evolutionary sequence 
alignment. Mol Biol EvoL, 21:529-540, 2004. 

[3] K Ezawa. General continuous-time Markov model of sequence evolution via inser¬ 
tions/deletions: are alignment probabilities factorable? BMC Bioinformatics., 17:304, 
2016. Erratum in: BMG Bioinformatics (2016) 17:457. 

[4] D Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Com¬ 
putational Biology. Gambridge University Press, Gambridge, UK, 1997. 

[5] S Kumar and A Filipski. Multiple sequence alignment: in pursuit of homologous DNA 
positions. Cenome Res., 17:127-135, 2007. 

[6] MR Aniba, O Poch, and JD Thompson. Issues in bioinformatics benchmarking: the 
case study of multiple sequence alignment. Nucleic Acids Res., 38:7353-7363, 2010. 

[7] A Loytynoja. Alignment methods: strategies, challenges, benchmarking, and compara¬ 
tive overview. In M Anisimova, editor. Evolutionary Cenomics. Methods in Molecular 
Biology (Methods and Protocols), vol. 855, pages 203-235. Humana Press, Totowa, NJ, 
2012 . 

[8] S lantorno, K Gori, N Goldman, M Gil, and G Dessimoz. Who watches the watch¬ 
men? an appraisal of benchmarks for multiple sequence alignment. In D Russell, edi¬ 
tor, Multiple Sequence Alignment Methods. Methods in Molecular Biology (Methods and 
Protocols), vol. 1079, pages 59-73. Humana Press, Totowa, NJ, 2014. 


99 



[9] T Warnow. Computational Phylogenetics: An introduction to designing methods for 
phytogeny estimation, chapter 9. Cambridge University Press, 2017. 

[10] A Loytynoja and N Goldman. Phylogeny-aware gap placement prevents errors in se¬ 
quence alignment and evolutionary analysis. Science., 320:1632-1635, 2008. 

[11] SB Needleman and CD Wunsch. A general method applicable to the search for similar¬ 
ities in the amino acid sequences of two proteins. J Mol Biol, 48:443-453, 1970. 

[12] TF Smith and MS Waterman. Identihcation of common molecular subsequences. J Mol 
Biol, 147:195-197, 1981. 

[13] DF Feng and RF Doolittle. Progressive sequence alignment as a prerequisite to correct 
phylogenetic trees. J Mol EvoL, 25:351-360, 1987. 

[14] JD Thompson, DC Higgins, and TJ Gibson. CLUSTAL W: improving the sensitivity of 
progressive multiple sequence alignment through sequence weighting, position-specific 
gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673-4680, 1994. 

[15] E Margulies and E Birney. Approaches to comparative sequence analysis: towards a 
functional view of vertebrate genomes. Nat Rev Genet., 9:303-313, 2008. 

[16] JR Lupski. Genomic rearrangements and sporadic disease. Nat Genet., 39(7 Suppl):S43- 
S47, 2007. 

[17] M Lynch. The Origins of Genome Architecture. Sinauer Associates, Sunderland, MA, 
2007. 

[18] W Gu, E Zhang, and JR Lupski. Mechanisms for human genomic rearrangements. 
Pathogenetics., 1:4, 2008. 


100 



[19] TH Lee, J Kim, JS Robertson, and AH Paterson. Plant genome duplication database. 
In A. van Dijk, editor. Plant Genomic Databases. Methods in Molecular Biology, vol. 
1533, pages 267-277. Humana Press, New York, NY, 2017. 

[20] CN Dewey. Whole-genome alignment. In M Anisimova, editor. Evolutionary Genomics. 
Methods in Molecular Biology (Methods and Protocols), vol. 855, pages 237-257. Hu¬ 
mana Press, Totowa, NJ, 2012. 

[21] D Earl, N Nguyen, G Hickey, RS Harris, S Fitzgerald, and K et. al. Beal. Alignathon: 
a competitive assessment of whole-genome alignment methods. Genome Res., 24:2077- 
2089, 2014. 

[22] JD Thompson. Statistics for Bioinformatics: Methods for Multiple Sequence Alignment, 
1st Edition, chapter 6. ISTE Press - Elsevier, 2016. 

[23] T Yada. Genome alignment. In S Ranganathan, K Nakai, and G Schonbach, editors. The 
Encyclopedia of Bioinformatics and Gomputational Biology, pages 268-283. Elsevier, 
Amsterdam, NL, 2019. 

[24] O Gotoh. An improved algorithm for matching biological sequences. J Mol Biol., 
162:705-708, 1982. 

[25] GB Do, MS Mahabhashyam, M Brudno, and S Batzoglou. ProbGons: probabilistic 
consistency-based multiple sequence alignment. Genome Res., 15:330-340, 2005. 

[26] G Notredame. Recent evolutions of multiple sequence alignment algorithms. PLoS 
Gomput Biol, 3:el23, 2007. 

[27] A Wilm, DG Higgins, and G Notredame. R-Goffee: a method for multiple alignment of 
non-coding RNA. Nucleic Acids Res., 36:e52, 2008. 


101 



[28] K Kryukov and N Saitou. MISHIMA-a new method for high speed multiple alignment 
of nucleotide sequences of bacterial genome scale data. BMC Bioinformatics., 11:142, 
2010 . 

[29] G Landan and D Graur. Gharacterization of pairwise and multiple sequence alignment 
errors. Gene., 441:141-147, 2009. 

[30] K Ezawa. Gharacterization of multiple sequence alignment errors using complete- 
likelihood score and position-shift map. BMC Bioinformatics., 17:133, 2016. 

[31] G hunter, A Rocco, N Mimouni, A Heger, A Galdeira, and J Hein. Uncertainty in 
homology inferences: assessing and improving genomic sequence alignment. Genome 
Res., 18:298-309, 2008. 

[32] RA Gartwright. Problems and solutions for estimating indel rates and length distribu¬ 
tions. Mol Biol Evol, 26:473-480, 2009. 

[33] J Hein, G Wiuf, B Knudsen, MB Mpller, and G Wibling. Statistical alignment: com¬ 
putational properties, homology testing and goodness-of-£t. J Mol Biol, 302:265-279, 
2000 . 

[34] MJ Bishop and EA Thompson. Maximum likelihood alignment of DNA sequences. J 
Mol Biol, 190:159-165, 1986. 

[35] JL Thorne, H Kishino, and J Felsenstein. An evolutionary model for maximum likeli¬ 
hood alignment of DNA sequences. J Mol Biol, 33:114-124, 1991. 

[36] JL Thorne, H Kishino, and J Felsenstein. Inching toward reality: An improved likelihood 
model of sequence evolution. J Mol Biol, 34:3-16, 1992. 

[37] B Knudsen and MM Miyamoto. Sequence alignments and pair hidden Markov models 
using evolutionary history. J Mol Biol, 333:453-460, 2003. 


102 



[38] I Miklos and Z Toroczkai. An improved model for statistical alignment. WABI 2001., 
LNCS 2149:1-10, 2001. 

[39] GH Gonnet, MA Gohen, and SA Benner. Exhanstive matching of the entire protein 
seqnence database. Science., 256:1443-1445, 1992. 

[40] SA Benner, MA Gohen, and GH Gonnet. Empirical and strnctnral models for insertions 
and deletions in the divergent evolntion of proteins. J Mol Biol., 229:1065-1082, 1993. 

[41] X Gn and WH Li. The size distribntion of insertions and deletions in hnman and rodent 
psendogenes snggests the logarithmic gap penalty for seqnence alignment. J Mol EvoL, 
40:464-473, 1995. 

[42] Z Zhang and M Gerstein. Patterns of nncleotide snbstitntion, insertion and deletion in 
the hnman genome inferred from psendogenes. Nucleic Acids Res., 31:5338-5348, 2003. 

[43] MS Ghang and SA Benner. Empirical analysis of protein insertions and deletions de¬ 
termining parameters for the correct placement of gaps in protein seqnence alignments. 
J Mol Biol, 341:617-631, 2004. 

[44] K Yamane, K Yano, and T Kawahara. Pattern and rate of indel evolntion inferred 
from whole chloroplast intergenic regions in sngarcane, maize and rice. DNA Res., 
13:197-204, 2006. 

[45] Y Fan, W Wang, G Ma, L Liang, Q Shi, and S Tao. Patterns of insertion and deletion 
in mammalian genomes. Curr Genomics., 8:370-378, 2007. 

[46] R Durbin, SR Eddy, A Krogh, and G Mitchison. Biological Sequence Analysis: Proba¬ 
bilistic Models of Proteins and Nucleic Acids. Gambridge University Press, Cambridge, 
UK, 1998. 

[47] K Mizuguchi, CM Dean, TL Blundell, and JP Overington. HOMSTRAD: a database of 
protein structure alignments for homologous families. Protein Sci., 7:2469-2471, 1998. 


103 



[48] RA Cartwright. DNA assembly with gaps (Dawg): simulating sequence evolution. 
Bioinformatics., 21 (Suppl. 3);iii31-iii38, 2005. 

[49] W Fletcher and Z Yang. INDELible: A flexible simulator of biological sequence evolu¬ 
tion. Mol Biol EvoL, 26:1879-1888, 2009. 

[50] CL Strope, K Abel, SD Scott, and EN Moriyama. Biological sequence simulation for 
testing complex evolutionary hypothesis: indel-Seq-Gen version 2.0. Mol Biol EvoL, 
26:2581-2593, 2009. 

[51] G hunter, I Miklos, A Drummond, J Jensen, and J Hein. Baysian phylogenetic inference 
under a statistical indel model. Lecture Notes in Bioinformatics., 2812:228-244, 2003. 

[52] G hunter, I Miklos, A Drummond, JL Jensen, and J Hein. Baysian coestimation of 
phylogeny and sequence alignment. BMC Bioinformatics., 6:83, 2005. 

[53] E Levy Karin, H Ashkenazy, J Hein, and T Pupko. A simulation-based approach to 
statistical alignment. Syst Biol, 68:252-266, 2019. 

[54] N De Maio. The cumulative indel model: fast and accurate statistical evolutionary 
alignment. Syst Biol, 2020. (available as E-pub). 

[55] I Holmes. Application of indel evolution by differential calculus of finite state automata, 
available in bioRxiv with doi: 10.1101/2020.06.29.178764., 2020. 

[56] K Ezawa. Alingment Neighborhood EXplorer (ANEX): Eirst attempt to apply gen¬ 
uine sequence evolution model with realistic insertions/deletions to Multiple Sequence 
Alignment reconstruction problem, preprint (KEZW_BLME00006.anex.pdf) available 
at: https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/., 2020. 

[57] G hunter. Probabilistic whole-genome alignments reveal high indel rates in the human 
and mouse genomes. Bioinformatics., 23:i289-i296, 2007. 


104 



[58] J Kim and S Sinha. Indelign: a probabilistic framework for annotation of insertions and 
deletions in a mnltiple alignment. Bioinformatics., 23:289-297, 2007. 

[59] J Felsenstein. Evolntionary trees from DNA sequences: a maximum likelihood approach. 
J Mol Evol, 17:368-376, 1981. 

[60] J. Felsenstein. Inferring Phylogenetics. Sinauer, Sunderland, Massachusetts, 2004. 

[61] Z. Yang. Computational Molecular Evolution. Oxford Univ. Press, Oxford, UK, 2006. 

[62] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numecial Recipes in 
C: The Art of Scientific Computing (2nd Ed.). Cambridge Univ. Press, Cambridge, 
UK, 1992. 

[63] I Holmes and WJ Bruno. Evolutionary HMMs: a Bayesian approach to multiple align¬ 
ment. Bioinformatics., 17:803-820, 2001. 

[64] I Holmes. Using guide trees to construct multiple-sequence evolutionary HMMs. Bioin¬ 
formatics., 19(Suppl I):il47~il57, 2003. 

[65] C Burge and S Karlin. Prediction of complete gene structures in human genomic dna. 
J Mol Biol, 268:78-94, 1997. 

[66] IH Holmes. Solving the master equation for indels. BMC Bioinformatics., 18:255, 2017. 

[67] K Ezawa. Review of the Commentary: ’’Solving the master equation for indels”. Posted 
on PubPeer (https://pubpeer.com)., 2017. 

[68] M Morgante, E De Paoli, and S Radovic. Transposable elements and the plant pan¬ 
genomics. Curr Opin Plant Biol, 10:1039-1049, 2007. 

[69] D Chalopin, M Navile, F Plard, D Galiana, and JN Vollff. Comparative analysis of trans¬ 
posable elements highlights mobilome diversity and evolution in vertebrates. Cenome 
Biol Evol, 7:567-580, 2015. 


105 



[70] J Ellegren. Microsatellites: simple sequences with complex evolution. Nat Rev Genet., 
5:435-445, 2004. 

[71] R Sainudiin, RT Durrett, CF Aquadro, and R Nielsen. Microsatellite mutation models: 
insights from a comparison of humans and chimpanzees. Genetics., 168:383-395, 2004. 

[72] F Famoud, M Schwartz, and J Bruck. Estimation of duplication history under a stochas¬ 
tic model for tandem repeats. BMG Bioinformatics., 20:64, 2019. 

[73] K Ezawa, D Graur, and G Landan. Perturbative formulation of general continuous-time 
Markov model of sequence evolution via insertions/deletions. Part IV: incorporation of 
substitutions and other mutations, available in bioRxiv with doi: 10.1101/023622., 
2015. 

[74] K Ezawa. Substitutional Residue-Difference Map (SRD Map) to help 

locate mis-alignments in Multiple Sequence Alignment (MSA): to¬ 

ward Artihcial-Intelilgence-assisted probability distribution of alterna¬ 
tive MSAs. preprint (KEZW_BLME00007.srdmap.pdf) available at: 

https://www.bioinformatics.org/ftp/pub/anex/Documents/Preprints/., 2020. 

[75] K Ezawa. (Approximate) Solutions to some technical issues on alignment probability 
calculation under genuine sequence evolution model with realistic insertions/deletions, 
(in preparation.). 


106 



